Multi Layer Perceptron

I know, you think that Multi Layer Perceptron seems to be very similar to the Perceptron. And yeah, it is but wait ! It’s not completely only that but lots of things are to be going under hood. To be frank, it is some what difficult to understand the underlying concepts in the Neural Networks at the beginning. But learning how they work really saves you from lots of bottleneck problems you may face later.

So as I said, Multi Layer Perceptron is a very large topic to cover in a single post. I accept that there are many blogs out there which explains this in a single post but trust me you will see a lot more,full packaged meal here. In order to completely understand the way they work, just follow the way this blog turns up.

What is Multi Layer Perceptron?

Okay! It is similar to Perceptron but instead of having only a single layer we have multi layers. That’s it? No there is more. First have a look at it.

3 layer neural network

So I guess you have an idea now. We can see that it consists of 3 layers namely Input layer, Hidden layer, Output layer. It is a very simple neural network with 3 layers. We can increase the number of layers based on the requirement. For now, let’s stick to the native 3 layer approach to better understand them.

What are these layers?

The Input layer takes in the input, the Output layer outputs the result, that’s it. But since we have multi layer approach things gonna be more critical from now on. The reason to it is the Hidden layer. We can have as many hidden layers but only one input and output layers as you might have figured it out why?

How do we predict/classify?

Remember the Perceptron? Yeah!! It is the same way we go with this but some what deep as it consists of one more layer !! We have the weights $W_1$ and $W_2$ associated for the layers. Suppose that our input is of dimension $N \times D$, where $N$ is the no.of.samples and $D$ is the no.of.features. And our output can be classified into $K$ classes. And no.of.hidden units be $M$. Then we have the shapes as follows

$X \in (N \times D)$

$W_1 \in (D \times M)$

$W_2 \in (M \times K)$

$Y \in (N \times K)$

So to predict the output we use feedforword method, which is simply a bottom-top approach as just as with perceptron. Therefore,

$ Z_1 = X \cdot W_1$

$ a_1 = g(Z_1)$

$Z_2 = a_1 \cdot W_2$

$a_2 = g(Z_2)$

where $g$ is the activation function. Normally sigmoid in majority of the cases. i.e,

$$ g(z) = \frac{1}{1+e^{-z}}$$

And that’s it the output we required is above. It is just the $ a_2$.

How do we learn the weights?

I’m glad you asked. But this is not as easy as we did with the perceptron, but remember that it is the same way we did with the perceptron. All the problem arises is that we have multiple layers now. So we need to implement a method called Backpropagation which made the neural network training a lot more easier.

Backpropagation?

It is the way of calculating gradients and to learn the weights. Assume that we have our cost function $J$ which is to be minimized. Also we need that cost function to be minimized with respect to $W_1$ and $W_2$. Since we have a layered approach it becomes difficult to handle things at once. So we apply backpropagation which simply follows the chain rule i.e, “Never stop until you reach the target”.

Therefore, given that $a_2$ as output and $Y$ to be true labels, the cost function is defined as follows.

$$J = \frac{1}{2N}\sum (a_2-Y)^2$$

and we need to find $W_1$ and $W_2$ which minimizes the cost $J$, i.e $ \frac{\partial J}{\partial W_1}$ and $ \frac{\partial J}{\partial W_2}$.

We need to apply chain rule to find the gradients.

$$ \frac{\partial}{\partial W_2}J = \frac{\partial J}{\partial a_2}  \frac{\partial a_2}{\partial Z_2}  \frac{\partial Z_2}{\partial W_2}$$

Simplifying,

$$\frac{\partial J}{\partial a_2} =  \frac{\partial}{\partial a_2} \frac{1}{2N} \sum (a_2-Y)^2 = (a_2 – Y)$$

$$ \frac{\partial a_2}{\partial Z_2} = \frac{\partial }{\partial Z_2}g(Z_2) = \bar g(Z_2)$$

in case of sigmoid,

$$ \bar g(Z_2) = a_2(1- a_2)$$

and,

$$ \frac{\partial Z_2}{\partial W_2} = \frac{\partial}{\partial W_2} (a_1 \cdot W_2) = a_1$$

combining all these,

$$ \frac{\partial}{\partial W_2} J ={a_1}^T \cdot (a_2 -Y)a_2(1-a_2)$$

which can be rewritten as,

$$\frac{\partial}{\partial W_2} J = {a_1}^T \cdot \delta_2 $$

where $ \delta_2 = (a_2 -Y)a_2(1-a_2)$

The same can be achieved for $W_1$ gradient using the chain rule. Therfore,

$$ \frac{\partial}{\partial W_1}J = \frac{\partial J}{\partial a_2}  \frac{\partial a_2}{\partial Z_2}  \frac{\partial Z_2}{\partial a_1}\frac{\partial a_1}{\partial Z_1}\frac{\partial Z_1}{\partial W_1}$$

We can observe some common terms in both $\frac{\partial J}{W_2}$ and $\frac{\partial J}{W_1}$ i.e, $\frac{\partial J}{\partial a_2}  \frac{\partial a_2}{\partial Z_2}$. Excluding them we can derive the others as follows.

$\frac{\partial Z_2}{\partial a_1} = W_2$

$\frac{\partial a_1}{\partial Z_1} = \bar g(Z_1) = a_1(1-a_1)$

$\frac{\partial Z_1}{\partial W_1} = X$

So finally we can combine all them as shown below.

$$ \frac{\partial}{\partial W_1}J = X^T \cdot [[a_1(1-a_1)] [(a_2-Y)a_2(1-a_2)] \cdot W^T]$$

The terms are ordered in such a way to preserve the same dimensions as $W_1$.

And the above equation can be rewritten as,

$$\frac{\partial}{\partial W_1}J = X^T \cdot \delta_1$$

where, $\delta_1 = (\delta_2 \cdot W^T) \bar g(Z_1)$

That’s all the backpropagation, it is as simple as that. All you need to remember is that “Never stop until you reach the target”.

How to update weights?

Oh! It’s the same way we had done before. We simply subtract the gradients from the weights, as simple as that. Therefore,

$W_m = W_m – \alpha \frac{\partial J}{\partial W_m}$

where $\alpha$ is the learning rate.

Is that all?

Unfortunately no. There is even more to discuss and I’m sorry for the very long post. But as I said earlier this post is simply a part of that what we intended to learn. So hold on, take a break, relax and come back.

Yes! We need to learn what are the ways to make our neural network learn (training)? I know the one we have implemented earlier is only a one among them. So what are they?

  • Gradient Descent (one we have seen earlier)
  • Stochastic Gradient Descent (SGD)
  • Batch Gradient Descent

Gradient Descent?

So gradient descent is just what we have seen with the perceptron here the training happens as described below.

for each_iteration in total_iterations{
    
    feedforward the input (all samples)
    find the gradients
    update the weights

}

So that means we update the weights only after processing all the inputs and we repeat this process up to certain stop criterion is met, here in this case the no.of.iterations. It is pretty slow as for each update it has to process all the inputs and takes more time to converge.

Stochastic Gradient Descent?

Well! It is nothing but a simple gradient descent but the only change being the we update our weights after processing each sample rather than all the samples. Therfore,

randomly shuffle the samples

for each_sample in total_samples{

feedforward the input 
find the gradients
update the weights

}

Remember to shuffle the samples randomly as it may help in converging the cost function faster. Unlike Gradient Descent the cost here is not constantly decreases rather it fluctuates rapidly but note that it is much more faster than the simple gradient descent and it allows you for the online learning.

stochastic-vs-batch-gradient-descent

As you can see from the above image that the gradient descent converges more smoothly than stochastic gradient descent. Most of the times we use stochastic gradient descent for a no.of.iterations so that it would converge to global minimum rather that fluctuating at the local minimum. Though the SGD is fast it may not converge to global minimum sometimes but with gradient descent it is guaranteed after several iterations.

Batch Gradient Descent?

Okay! It is a hybrid model for Gradient Descent and Stochastic Gradient Descent. All we do here is we split the samples into batches and train them with Gradient Descent. So it follows gradient descent while splitting the samples without replacement into batches tends to be using stochastic gradient descent. Therfore,

randomly shuffle the samples
split the samples into batches

for each_batch in total_batches{
    feedforward the batchu
    find the gradients 
    update weights
}

While the above shown process can be fair enough to converge to global minimum as fast as possible. It is recommended to repeat the above through certain no.of.iterations for more accurate results.

sgdvsgd

The above image is enough to explain the difference between them gradient descent and stochastic gradient descent. Though it shows that SGD has not converged but it may converge after several iterations. Also note that the SGD converged so fast at beginning and begin to fluctuate a lot so to make it more robust batch gradient descent is preferable.

Anything more?

Yeah ! I will introduce one more term called Softmax. Softmax is just like our sigmoid function but it is applied at the output layer to extract the probability of outcomes in classification problems. Though we have sigmoid function it doesn’t gives us the probabilistic approach and hence Softmax is always preferable when our tasks are associated with classification with more than 2 classes involved. So softmax can be defined as below.

$$Softmax(z)_j = \frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_k}}$$

On observing the above equation, we can deduce that $\sum_{k=1}^{K}Softmax(z_k) = 1$. So which is the rule of thumb in probability theory.

And also the need for discussing Cross Entropy is a must.

Cross Entropy?

Don’t get scared by the name. It’s just an another way of computing cost. So far we have been using Least Squares Error method to find the cost. But that would not be good when dealing with the classification problems. So the cross entropy is introduced. It is nothing but the logistic loss. Therefore,

$$J = -\frac{1}{N}\sum_{n=1}^{N}[y_n log (\bar y_n) + (1-y_n)log (1-\bar y_n)]$$

where $y$ and $\bar y$ are the true and predicted outputs.

Remember that we don’t need to worry about the $\frac{\partial J}{\partial W_k}$ since the derivative of the cross entropy yields the same result as before. But it is worth noting that using cross entropy in classification problems is a good idea.

Okay, drive me to code!!

Not so fast, atleast i’m tired. We will plug the concepts we have learned in this post to build our own neural network from scratch and then we explore Keras which is a good library to build neural networks more easily. And in the next posts we shall see how to apply the learned knowledge to recognize the hand written digits and later extend to facial expression recognition both in images and live streaming video.