This post is quite **smaller** than previous ones as it just states the **optimization** techniques required to train a neural network. I thought of including a example but many resources are available free online and therefore it is not necessary to **reinvent** the **wheel**. So coming straight to the point, Optimization techniques are those which facilitate our model to **converge faster** and providing much more efficiency and makes our model more **accurate**.

### Why Optimize?

It’s very simple to explain. Think of you are on a mountain and all you needed is to get down. It’s a very easy task right? Ofcourse yes, if you know the correct direction to move on and how speed your step should be. And this is what exactly Optimization is. To be more concise there are two types of Optimizations.

**First Order methods (***Jacobian matrix*)**Second Order methods (***Hessian matrix*)

First-order methods are used to calculate **Jacobian matrix**. The Jacobian can be defined as a differentiable function that approximates a given point *x* (a number in the parameter vector). This is the direct approximation, mapping a given input onto $f$ our function. Which means it is used to state the which direction we should move on.

Second-order algorithms calculate the derivative of the Jacobian (i.e., the derivative of the matrix of derivatives), which is called **Hessian**. Second-order methods are used for controlling the step size, which is how much an algorithm adjusts the weights as it tries to learn. They’re also useful in determining which way your gradient is sloping.

### Momentum

Momentum is the First-order method used to speed up training by moving the gradients in the direction we were already going. It can be achieved by two methods namely

**Regular momentum****Nesterov momentum**

Regular momentum is inspired from the classical mechanics which is the product of mass and velocity and here we define the $\mu$ which is the momentum that helps us move in the direction we were going. So it can be defined as

$$v(t-1) = \Delta w(t-1)$$

$$\Delta w(t) = \mu v(t-1) – \alpha \nabla J(t)$$

**Nesterov momentum** is an advancement to regular momentum where we actually move in such a direction by looking **ahead** and then correct ourselves if we made a **mistake**. The mathematics which is involved is a bit more complicated but take it as granted that it is always better than regular momentum since we are correcting ourselves by looking a step ahead.

$$v(t) = \mu v(t-1) – \alpha \nabla J(t)$$

$$\Delta w(t) = -\mu v(t-1) + (1+\mu) v(t)$$

The image below gives a better intuition on momentum updates.

### Adaptive learning rate

This comes under the **Second-order** methods where it adjusts the * learning rate* automatically rather than defining manually by ourselves. It is because if the learning rate is constant, the gradient may end up fluctuating rather than converging. So all we need is to decrease the learning rate as we move on so that it converges better. Several techniques are available to help achieve the adaptive learning rate.

- Step Decay
- Exponential Decay
- $\frac{1}{t} decay$
- AdaGrad
- AdaDelta
- RMSprop
- Adam

**Step decay:**

In step decay we generally reduce the learning rate by a factor after certain `no.of.steps`

. So for example we can take this as for every 10 steps the `learning_rate /= k`

, where $k$ is the scaling-factor.

**Exponential decay**

In this method we reduce the `learning_rate`

exponentially such that $learning\_rate = learning\_rate e^{-\lambda*t}$ where $\lambda$ is the scaling-factor and $t$ be the iteration number.

**$\frac{1}{t}$ decay**

Here the learning rate is adjusted as follows.

$learning\_rate = learning\_rate/(1+kt)$

**AdaGrad**

It is the better adaptive method than above as it is based on the weight changes so far. It keeps track of **cache** which is the square of gradient that allows it to be adaptive with **weight changes**. Therefore

$$cache = gradient^2$$

$$w = w – learning\_rate \frac{gradient}{\sqrt{cache}+\epsilon}$$

$\epsilon$ is introduced so that we do not divide by zero.

**AdaDelta**

Adadelta is an extension of * Adagrad* that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients,

*restricts the window of accumulated past gradients to some fixed size. Therfore the only change is the gradients associated are the $k$ latest gradients instead of all the gradients so far.*

**Adadelta****RMSprop**

This technique is introduced by * Geoff Hinton* which adds

**decay**to the cache itself. It is similar to AdaDelta, but the only change being the cache itself decays which increases chances of converging to global minimum faster and efficiently.

$$cache = decay\_rate * cache + (1-decay\_rate)*grad^2$$

$$w = w – learning\_rate \frac{gradient}{\sqrt{cache}+\epsilon}$$

**Adam**

It is similar to * AdaDelta* and

**and the only advancement here is it keeps track of averages of past gradients as AdaDelta and RMSprop and also tracks the averages of past decayed gradients which further makes more accurate in gradient descent operations. The averages of exponentially decaying past gradients is tuned by the parameter $\beta$.**

*RMSprop*### Where to move on now?

As I said before, the main objective of this blog is to give an **intuitive** idea of things that are commonly difficult to understand in the field of Machine and Deep learning and Computer Vision. I do not prefer to *reinvent* *the* *wheel*. You better move on to the following links for getting started into the field of deep learning and computer vision and come back here. As the posts from now on will be direct projects that a beginner may find difficult to start on. I’ll try my best to help you understand things but it is better have a look at these links before coming back to this blog.

- Tensorflow tutorials (Hvass Labs)
- Official theano tutorials
- Keras Documentation
- Convolution Neural Networks (cs231 Stanford)
- FastAI course
- Natural Language Processing (University of Michigan)
- Pyimagesearch (highly recommended for computer vision beginners)

### What will be covered in the next blog posts?

I’m glad you asked!! As said before, I will start writing projects (parts wise) which uses deep learning, computer vision and sometimes natural language processing. You may have observed that this post is very delayed from the previous post. All I’ve done was coded some projects to post in this blog. These are the few I’ve completed coding and ready to post.

It takes an image and recognize digits in it.**Handwritten digits recognition –**Now this just captures the puzzle and solves it. Uses computer vision and deep learning for digit recognition. An extension to above.**Sudoku solver (9X9) –**: In images and live streaming video.**Facial expression detection**From data collection and annotating to detecting.**Custom Object Detection –**Mean Shift,Cam Shift and Lukas&Kande**Object tracker with occlusions reduced –**Similar to what we see in fashions category in e-commerce.**A fashion image search engine –**We use Caltech-10 and Caltech-101 for this purpose and will also show you how to use pre-trained models.**Object Recognition –**From collecting feature words to detecting aspects and extracting sentiments (NLP)**Aspect based sentiment Analysis –**From data collection to detecting (NLP)**Fake Review Detector –**

These are a few I’ve completed the coding part and there will be more. Remember that I’ll be posting simple one page posts in between which can be considered as tweaks or utilities which are short and exciting.

I’m planning to implement **RCNN** and **Deep Ranking** (http://users.eecs.northwestern.edu/%7Ejwa368/pdfs/deep_ranking.pdf) completely from scratch in ** keras** or

**. Also many more to come.**

*caffe*Thank you, Have a nice day…