RMSprop saviour of Adagrad

Vinrawat
2 min readJun 27, 2021

We know how the optimizer is helping us in reaching the global minima faster and how one optimizer is related to the other as each optimizer have its own advantage and disadvantage. Let us understand the optimizer which is resolving the problem of Adagrad.

Let’s understand what problem Adagrad was solving:-

We know that every problem is different from others some have a complex structure

In the case of this figure which is taken from the top angle of complex 3d projection here, we can see that if weights initialised at ‘b’ then gradient will move toward minima in a straight line (as our gradient direction is always perpendicular to tangent in gradient ascend but we need to move minima so we will use negative sign with gradient descend) but in case of point ‘a’ gradient have to change the direction with every layer to reach the global minima. This problem is resolved by Adagrad Optimizer where it’s taking a unit step with the change in direction which was not done by other optimizers like momentum and Nesterov.

Let us understand mathematically how it’s doing that.

Let’s understand the equation, here we can see that with the increasing epoch our gradient will decay as we are using the square of the previous gradient in the denominator that will make the function value low this will lead to early stopping before reaching global minima.

To solve this problem RMS prop is used which is introduced by Geoff Hinton. In this method, we are giving preference to the most recent weight rather than older weight using exponential decay.

Let’s understand mathematically

From above we can see that we are introducing a beta term which is helping us to take the amount of weight we want to use from the previous gradient. This will resolve the problem of Adagrad.

--

--