Face Recognition with Deep Learning

Hello, all! I hope you got excited by the title itself. What if I tell you that building a face recognition system is not so difficult? Yes, it is, and of course very exciting. Let’s build a complete face recognition system which enables you to enroll a new candidate into the system and perform recognition with higher accuracy!

Before diving into the implementation, let’s get some intuition about how these systems work and how to build one such system for us.

face recognition pipeline

Almost all face recognition systems work in a way shown below. Please note that the image shown below is taken from the openface project.

Face Recognition pipeline from openface

So, given an image

  1. Preprocess the image
  2. Detect faces
  3. Align faces
  4. Extract face embeddings
  5. Recognize/Classify

Let’s review each of them below.

Preprocess the image

Well, it is the most basic step we need to follow for any computer vision application to be robust enough. There are many ways for processing an image. For the purpose of face recognition, the lighting conditions should be fair enough to obtain higher accuracy. And of course, it is not very easy to design an application that adapts for every condition. So preprocessing makes an image good enough to proceed with the flow.

For our application, we will be using gamma correction for enhancing the image quality. And another limitation of any face recognition system is its speed. To make the system work in real-time it’s always better to downscale the image before passing it to further processing. Beware, that you don’t over downscale as it will be difficult for our system to detect faces in them.

Let’s open utils.py and start coding.

import cv2
import numpy as np

def enhance_image(image):
    image_YCrCb = cv2.cvtColor(image, cv2.COLOR_BGR2YCR_CB)
    Y, Cr, Cb = cv2.split(image_YCrCb)
    Y = cv2.equalizeHist(Y)
    image_YCrCb = cv2.merge([Y, Cr, Cb])
    image = cv2.cvtColor(image_YCrCb, cv2.COLOR_YCR_CB2BGR)
    return image

def adjust_gamma(image, gamma=1.0):
    invGamma = 1.0 / gamma
    table = np.array([((i / 255.0) ** invGamma) * 255
                      for i in np.arange(0, 256)]).astype("uint8")
    return cv2.LUT(image, table)

The enhance_image function enhances the image by normalizing the luma component (Y) of the converted image. We choose to convert the image to YCrCb color space as we have control over the luminance and chroma of the image.

And we can adjust the gamma of the image by using the adjust_gamma function as shown above.

detect faces

Face detection sample representation

This is the most important step for the system. There are numerous ways for detecting faces in the image. We will review two of them.

Hog+svm face detector

Histogram of Oriented Gradients is one of the most popular feature extractors in image processing. They are heavily used in early object detection algorithms as they are capable of extracting structural features of an object. It works by computing the gradient magnitude and its direction from the image divided by cells. You can refer this link for more detailed information on how HOG works.

Preview of Obama's face in HOG

Histogram of Oriented Gradients

Now those extracted HOG features are used to train a classifier by using SVM algorithm. And now a window of fixed size is slid over an image pyramid and the regions are classified. Then the classified regions are passed through Non-Maximum Suppression to reject the overlapping regions ranked by classification score.

cnn mmod face detector

Max-Margin Object Detection improves the normal object detection by replacing the non-maximum suppression with a new objective function. Sometimes the overlapping windows get rejected in non-maximum suppression which might be positive. However, this method overcomes the problem, by designing an optimizer that runs over all windows and optimizes the performance of an object detection system in terms of the number of missed detections and false alarms in the final system output. To know more, please read this paper.

A convolution neural network (CNN) is used as a binary classifier and it is used to classify each of the subwindows.

Let’s implement these face detectors in detectors.py

import dlib
import cv2

face_detector = dlib.get_frontal_face_detector()
#face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")

def scale_faces(face_rects, down_scale=1.5):
    faces = []
    for face in face_rects:
        scaled_face = dlib.rectangle(int(face.left() * down_scale),
                                    int(face.top() * down_scale),
                                    int(face.right() * down_scale),
                                    int(face.bottom() * down_scale))
    return faces

def detect_faces(image, down_scale=1.5):
    image_scaled = cv2.resize(image, None, fx=1.0/down_scale, fy=1.0/down_scale, 
    faces = face_detector(image_scaled, 0)
  #  faces = [face.rect for face in faces]
    faces = scale_faces(faces, down_scale)
    return faces

dlib provides  HOG+SVM face detector through dlib.get_frontal_face_detector, and you can load the pre-trained mmod face detector using dlib.cnn_face_detection_model_v1 by specifying the path to pre-trained model. The detect_faces function takes an image and returns the face regions in the image. Before detecting the faces in the image it is always better to downsample the image to make the face detection faster. The scale_faces function is used to rescale the detected faces in the downscaled image to the original size.

The detect_faces function takes an image and returns the face regions in the image. Before detecting the faces in the image it is always better to downsample the image to make the face detection faster. The scale_faces function is used to rescale the detected faces in the downscaled image to the original size.

You should uncomment the line 21 when using mmod face detector as it returns face regions as dlib rectangles and confidence.

if __name__ == "__main__":
    image = cv2.imread("example.jpg")

    faces = detect_faces(image, down_scale=0.5)
    for face in faces:
        x,y,w,h = face.left(), face.top(), face.right(), face.bottom()
        cv2.rectangle(image, (x,y), (w,h), (255,200,150), 2, cv2.CV_AA)

    cv2.imshow("Image", image)

You can see the face detection results below.

Face detection results on group image

align faces

If you have observed the above image closely, the faces not aligned properly, i.e, most of the faces are either tilted or turned. To achieve high accuracy in face recognition it’s always better to align the faces to some reference template.

One way of doing it is by finding the facial landmarks and then transforming them to the reference coordinates. Again, dlib have a pre-trained model for predicting the facial landmarks. We can extract the facial landmarks using two models, either 68 landmarks or 5 landmarks model. For the purpose of face recognition, 5 points predictor is enough as it is very light and computationally faster.

Comparison between 68 and 5 landmark detectors

68 landmarks vs 5 landmarks

From the above image, we can say that the 5 landmarks model is enough for aligning the face to a standard pose. The face alignment is again handled by dlib automatically as a preprocessing step when extracting facial embeddings.

Extract face embeddings

Before getting into what exactly face embeddings are, I would like to tell you one thing that face recognition is not a classification task. Of course, classification is one way to tackle the problem of face recognition but it doesn’t mean face recognition alone is a classification problem.

So, we humans perceive human faces very differently. Our brain process faces by extracting various features from the face. In general we are not classifying between the faces, instead, we are actually comparing the face we see with the faces we have already seen.

Therefore, we formulate the face recognition as a problem of extracting features from a face where the same faces have similar features and others are way different. In fact, the idea is to extract a fixed vector from a face such that the distance between same faces is very less and is very high for different faces.

How do we extract such feature vectors? The answer is training a deep neural network, here a convolutional neural network with the objective of minimizing the distance between the output feature vectors of the same face and maximizing the distance between different faces.

To learn more about this technique, read my previous post Building a Reverse Image Search Engine where I’ve given complete details on building and training such systems.

Let’s fire up extractors.py and start coding

import numpy as np

def extract_face_embeddings(image, face_rect,shape_predictor,face_recognizer):
    shape = shape_predictor(image, face_rect)
    face_embedding = face_recognizer.compute_face_descriptor(image, shape)
    face_embedding = [x for x in face_embedding]
    face_embedding = np.array(face_embedding, dtype="float32")[np.newaxis, :]
    return face_embedding

The extract_face_embeddings takes an image, a dlib face rectangle object, dlib face landmarks detector and a face recognizer instance and extracts a fixed 128D feature vector. We use a dlib pre-trained model to extract the embeddings.

Now, we have our system ready, let’s see how do we enroll new candidates to our system and save the data to disk.

We include the helper functions in db.py which handles the data for our system.

import numpy as np
import cPickle

def add_embeddings(embedding, label, 
    first_time = False
        embeddings = np.load(embeddings_path)
        labels = cPickle.load(open(labels_path))
    except IOError:
        first_time = True

    if first_time:
        embeddings = embedding
        labels = [label]
        embeddings = np.concatenate([embeddings, embedding], axis=0)

    np.save(embeddings_path, embeddings)
    with open(labels_path, "w") as f:
        cPickle.dump(labels, f)

    return True

The function add_embeddings takes a feature vector and a label and save them to a .npy file and serialize the labels to disk.

Now let’s write a script named enroll.py which enrolls faces to the system.

from extractors import extract_face_embeddings
from detectors import detect_faces
from db import add_embeddings
import dlib

shape_predictor = dlib.shape_predictor("models/shape_predictor_5_face_landmarks.dat")
face_recognizer = dlib.face_recognition_model_v1("models/dlib_face_recognition_resnet_model_v1.dat")

def enroll_face(image, label,
                labels_path="labels.pickle", down_scale=1.0):

    faces = detect_faces(image, down_scale)
    if len(faces)<1:
        return False
    if len(faces)>1:
        raise ValueError("Multiple faces not allowed for enrolling")
    face = faces[0]
    face_embeddings = extract_face_embeddings(image, face, shape_predictor, 
    add_embeddings(face_embeddings, label, embeddings_path=embeddings_path,
    return True

if __name__ == "__main__":
    import cv2
    import glob
    import argparse

    ap = argparse.ArgumentParser()
    ap.add_argument("-d","--dataset", help="Path to dataset to enroll", required=True)
    ap.add_argument("-e","--embeddings", help="Path to save embeddings",
    ap.add_argument("-l","--labels", help="Path to save labels",

    args = vars(ap.parse_args())
    filetypes = ["png", "jpg"]
    dataset = args["dataset"].rstrip("/")
    imPaths = []

    for filetype in filetypes:
        imPaths += glob.glob("{}/*/*.{}".format(dataset, filetype))

    for path in imPaths:
        label = path.split("/")[-2]
        image = cv2.imread(path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        enroll_face(image, label, embeddings_path=args["embeddings"],

The enroll.py expects a console argument called dataset i.e, a path to the directory consisting of directories named by candidates we want to enroll. Optionally you can give paths to save embeddings and labels. We iterate over the image paths and convert the image to RGB since the trained network expects it so, and we enroll the faces using the function enroll_face.

Note that the images in the folder should contain a single face image, otherwise it raises an exception.


Now, that we have extracted face embeddings and we can recognize a face simply by comparing with other face embeddings using a distance vector. Note that we should use the same distance vector used in the loss function when training the model. Here, it’s Euclidean distance.

Let’s fire up recognize.py and start writing the final code block.

import numpy as np

def recognize_face(embedding, embeddings, labels, threshold=0.5):
    distances = np.linalg.norm(embeddings - embedding, axis=1)
    argmin = np.argmin(distances)
    minDistance = distances[argmin]

    if minDistance>threshold:
        label = "Unknown"
        label = labels[argmin]

    return (label, minDistance)

The function recognize_face takes a query embedding, saved embeddings and labels and finds the index of the minimum distance between query embedding and saved embeddings. If the distance is less than the given threshold then the corresponding label is returned otherwise it is labeled as Unknown.

Now let’s continue to write the driver script which takes in an image and performs recognition.

if __name__ == "__main__":
    import cv2
    import argparse
    from detectors import detect_faces
    from extractors import extract_face_embeddings
    import cPickle
    import dlib

    ap = argparse.ArgumentParser()
    ap.add_argument("-i","--image", help="Path to image", required=True)
    ap.add_argument("-e","--embeddings", help="Path to saved embeddings",
    ap.add_argument("-l", "--labels", help="Path to saved labels",
    args = vars(ap.parse_args())

    embeddings = np.load(args["embeddings"])
    labels = cPickle.load(open(args["labels"]))
    shape_predictor = dlib.shape_predictor("models/"
    face_recognizer = dlib.face_recognition_model_v1("models/"

    image = cv2.imread(args["image"])
    image_original = image.copy()
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    faces = detect_faces(image)

    for face in faces:
        embedding = extract_face_embeddings(image, face, shape_predictor, face_recognizer)
        label = recognize_face(embedding, embeddings, labels)
        (x1, y1, x2, y2) = face.left(), face.top(), face.right(), face.bottom()
        cv2.rectangle(image_original, (x1, y1), (x2, y2), (255, 120, 120), 2, cv2.CV_AA)
        cv2.putText(image_original, label[0], (x1, y1 - 10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 0), 2)

    cv2.imshow("Image", image_original)

We load our pre-trained facial landmarks detector and face recognition model beforehand. Then we read the image using a console argument and then we convert it to RGB since the neural network expects it so.

Then we detect faces using the function detect_faces and then iterate each of them to extract embeddings and recognition by finding the minimum distance between embeddings. Finally, we draw a rectangle and insert a label text to the image and display back to the user.

Face recognition results on obama image

Please note that you should download two files to the models/

  1. Pre-trained landmarks detector
  2. Pre-trained face embeddings extractor

You can get the code from my github repository.

Let’s see the results on some other images.


In this post, we have built a complete face recognition pipeline with the help of dlib pre-trained models. However, if you want to learn about building these systems from scratch you can refer to my previous post. In the next posts, I’ll be writing about how to build and train your own YOLO network completely from scratch using tensorflow and keras and diagnosing CNN’s by visualizing what the networks are actually learning.