In Simple BGD (Batch Gradient Descent) the parameters are updated by some of all the square error of data set. However, BGD convergence much accurately but when there are large data set then the convergence will take lot of time and memory.

Therefore, to overcome these problems, following evolution were made on optimizing parameters.

Stochastic Gradient Descent or SGD, is uses few or single training data set to update the parameters. This is define mathematical as:

for i in m:
θ = θ − α ∂/∂θ J(θ; x(i), y(i))

Learning rate is very difficult to determine for SGD since there is lot of variance these learning rate are kept small. On way is to keep learning rate small enough that give stable convergence in the initial epoch.

The problem with SGD is if the data is spares (which mostly is) then SGD will not reach the local minimum. SGD can be visualized as

This is the variance of the gradient descent, in which it splits the data sets into small batches. Implementation may choose to sum all the gradient descent over the mini-batch or take the average of the gradient which further reduce the variance.

This provide benefit to get not all the training data-set into memory furthermore, It provides computation efficiently.

Momentum

The problem with SGD and variation of gradient descent is that it never reaches the local optima where the surface curve is more deep.

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. This can be visualized as the giant ball rolling down the hill and as the ball is rolling it becomes faster and faster on the way. Similarly momentum increases for dimensions whose gradient point in the same direction and reduces update for dimensions whose gradient direction.

It does this by adding fraction ϒ of the update vector of the past time step to the current step vector update. ϒ is a hyperparameter, whose value are usually set to 0.9

v(t)=γv(t−1)+η∇θJ(θ)
θ=θ−vt

Nesterov found the problem with Momentum that when the Momentum reaches its local optima due to the momentum it pass through the local optima. In order to solve the problem Nesterov computes θγv(t1) which gives the approximate of the next position of the parameters. The rough idea of parameter is going.

v(t)=γv(t−1)+η∇θJ(θγv(t1))
θ=θ−vt

While Nesterov first compute the current gradient and then take a big jump in the direction of the accumulated gradient. NAG compute the previous accumulate gradient. Now NAG measures the gradient and make correction which results in updates. This help prevent us going too fast and thus increase the responsiveness.

It adopts the learning rate to the parameters, perform large update for infrequent parameters and small updates for frequent parameters. Previously, we were using the same learning rate for very update in parameter. As AdaGrad uses different learning rate for every parameters in every time step (t)

It can represent has θ(t+1,i) θ(t,i) − ηg(t,i)θ(t+1,i)

Where g(t,i) is the gradient with respect to parameter θ(i) at time t. In the update rule, AdaGrad updates learning rate at each step time (t) based on the past gradient descent that had been computed for θ(i).

G(t,ii) is the diagonal matrix where each diagonal element i,i is the sum of the squares of the gradient descent with respect to time (t) and parameter θ(i). While  ϵ is the smoothing term to avoid divide by zero.

Main benefit of AdaGrad is that it remove the needs for setting the learning rate. But note that the accumulated sum in the denominator keeps growing as they are positive therefore the learning rate keeps on shrinking and eventually algorithm is not able to learn more.

Fixed windows can be represented as E[g^2](t)=γE[g^2](t−1)+(1−γ)g(t)^2 . Where γ is same as in Momentum, and its value is usually is 0.9. E is the result of previous gradient average with current average. Now we can replace G(t) in the AdaGrad with E[g^2]

Thus becomes,

We compute the decaying averages of past and past squared gradient mt and vt respectively.

m(t) =β(1)m(t1)+(1β1)g(t)
v(t) = β(2)v(t−1) + (1−β2)g^2

The author of Adam observers that these mt and vt are biased towards zero especially in initial time and when the decay rate are small

They used to update parameter as you seen in other gradient descent.

Reference: