Optimize Neural nets

This post is quite smaller than previous ones as it just states the optimization techniques required to train a neural network. I thought of including a example but many resources are available free online and therefore it is not necessary to reinvent the wheel. So coming straight to the point, Optimization techniques are those which facilitate our model to converge faster and providing much more efficiency and makes our model more accurate.

Why Optimize?

It’s very simple to explain. Think of you are on a mountain and all you needed is to get down. It’s a very easy task right? Ofcourse yes, if you know the correct direction to move on and how speed your step should be. And this is what exactly Optimization is. To be more concise there are two types of Optimizations.

  • First Order methods (Jacobian matrix)
  • Second Order methods (Hessian matrix)

First-order methods are used to calculate Jacobian matrix. The Jacobian can be defined as a differentiable function that approximates a given point x (a number in the parameter vector). This is the direct approximation, mapping a given input onto $f$ our function. Which means it is used to state the which direction we should move on.

Second-order algorithms calculate the derivative of the Jacobian (i.e., the derivative of the matrix of derivatives), which is called Hessian. Second-order methods are used for controlling the step size, which is how much an algorithm adjusts the weights as it tries to learn. They’re also useful in determining which way your gradient is sloping.


Momentum is the First-order method used to speed up training by moving the gradients in the direction we were already going. It can be achieved by two methods namely

  • Regular momentum
  • Nesterov momentum

Regular momentum is inspired from the classical mechanics which is the product of mass and velocity and here we define the $\mu$ which is the momentum that helps us move in the direction we were going. So it can be defined as

$$v(t-1) = \Delta w(t-1)$$

$$\Delta w(t) = \mu v(t-1) – \alpha \nabla J(t)$$

Nesterov momentum is an advancement to regular momentum where we actually move in such a direction by looking ahead and then correct ourselves if we made a mistake. The mathematics which is involved is a bit more complicated but take it as granted that it is always better than regular momentum since we are correcting ourselves by looking a step ahead.

$$v(t) = \mu v(t-1) – \alpha \nabla J(t)$$

$$\Delta w(t) = -\mu v(t-1) + (1+\mu) v(t)$$

The image below gives a better intuition on momentum updates.


Adaptive learning rate

This comes under the Second-order methods where it adjusts the learning rate automatically rather than defining manually by ourselves. It is because if the learning rate is constant, the gradient may end up fluctuating rather than converging. So all we need is to decrease the learning rate as we move on so that it converges better. Several techniques are available to help achieve the adaptive learning rate.

  • Step Decay
  • Exponential Decay
  • $\frac{1}{t} decay$
  • AdaGrad
  • AdaDelta
  • RMSprop
  • Adam

Step decay:

In step decay we generally reduce the learning rate by a factor after certain no.of.steps. So for example we can take this as for every 10 steps the learning_rate /= k, where $k$ is the scaling-factor.

Exponential decay

In this method we reduce the learning_rate exponentially such that $learning\_rate = learning\_rate e^{-\lambda*t}$ where $\lambda$ is the scaling-factor and $t$ be the iteration number.

$\frac{1}{t}$ decay

Here the learning rate is adjusted as follows.

$learning\_rate = learning\_rate/(1+kt)$


It is the better adaptive method than above as it is based on the weight changes so far. It keeps track of cache which is the square of gradient that allows it to be adaptive with weight changes. Therefore

$$cache = gradient^2$$

$$w = w – learning\_rate \frac{gradient}{\sqrt{cache}+\epsilon}$$

$\epsilon$ is introduced so that we do not divide by zero.


Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size. Therfore the only change is the gradients associated are the $k$ latest gradients instead of all the gradients so far.


This technique is introduced by Geoff Hinton which adds decay to the cache itself. It is similar to AdaDelta, but the only change being the cache itself decays which increases chances of converging to global minimum faster and efficiently.

$$cache = decay\_rate * cache + (1-decay\_rate)*grad^2$$

$$w = w – learning\_rate \frac{gradient}{\sqrt{cache}+\epsilon}$$


It is similar to AdaDelta and RMSprop and the only advancement here is it keeps track of averages of past gradients as AdaDelta and RMSprop and also tracks the averages of past decayed gradients which further makes more accurate in gradient descent operations. The averages of exponentially decaying past gradients is tuned by the parameter $\beta$.

Where to move on now?

As I said before, the main objective of this blog is to give an intuitive idea of things that are commonly difficult to understand in the field of Machine and Deep learning and Computer Vision. I do not prefer to reinvent the wheel. You better move on to the following links for getting started into the field of deep learning and computer vision and come back here. As the posts from now on will be direct projects that a beginner may find difficult to start on. I’ll try my best to help you understand things but it is better have a look at these links before coming back to this blog.

What will be covered in the next blog posts?

I’m glad you asked!! As said before, I will start writing projects (parts wise) which uses deep learning, computer vision and sometimes natural language processing. You may have observed that this post is very delayed from the previous post. All I’ve done was coded some projects to post in this blog. These are the few I’ve completed coding and ready to post.

  • Handwritten digits recognitionIt takes an image and recognize digits in it.
  • Sudoku solver (9X9)Now this just captures the puzzle and solves it. Uses computer vision and deep learning for digit recognition. An extension to above.
  • Facial expression detection: In images and live streaming video.
  • Custom Object Detection – From data collection and annotating to detecting.
  • Object tracker with occlusions reduced – Mean Shift,Cam Shift and Lukas&Kande
  • A fashion image search engine – Similar to what we see in fashions category in e-commerce.
  • Object Recognition – We use Caltech-10 and Caltech-101 for this purpose and will also show you how to use pre-trained models.
  • Aspect based sentiment Analysis – From collecting feature words to detecting aspects and extracting sentiments (NLP)
  • Fake Review Detector – From data collection to detecting (NLP)

These are a few I’ve completed the coding part and there will be more. Remember that I’ll be posting simple one page posts in between which can be considered as tweaks or utilities which are short and exciting.

I’m planning to implement RCNN and Deep Ranking (http://users.eecs.northwestern.edu/%7Ejwa368/pdfs/deep_ranking.pdf) completely from scratch in keras or caffe. Also many more to come.

Thank you, Have a nice day…