In Simple BGD (Batch Gradient Descent) the parameters are updated by some of all the square error of data set. However, BGD convergence much accurately but when there are large data set then the convergence will take lot of time and memory.

Therefore, to overcome these problems, following evolution were made on optimizing parameters.

**Stochastic Gradient Descent **or SGD, is uses few or single training data set to update the parameters. This is define mathematical as:

for i in m:

θ = θ − α ∂/∂θ J(θ; x(i), y(i))

Learning rate is very difficult to determine for SGD since there is lot of variance these learning rate are kept small. On way is to keep learning rate small enough that give stable convergence in the initial epoch.

The problem with SGD is if the data is spares (which mostly is) then SGD will not reach the local minimum. SGD can be visualized as

**Mini Batch Gradient Descent**

This is the variance of the gradient descent, in which it splits the data sets into small batches. Implementation may choose to sum all the gradient descent over the mini-batch or take the average of the gradient which further reduce the variance.

This provide benefit to get not all the training data-set into memory furthermore, It provides computation efficiently.

**Momentum**

The problem with SGD and variation of gradient descent is that it never reaches the local optima where the surface curve is more deep.

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. This can be visualized as the giant ball rolling down the hill and as the ball is rolling it becomes faster and faster on the way. Similarly momentum increases for dimensions whose gradient point in the same direction and reduces update for dimensions whose gradient direction.

It does this by adding fraction ϒ of the update vector of the past time step to the current step vector update. ϒ is a hyperparameter, whose value are usually set to 0.9

v(t)=γv(t−1)+η∇θJ(θ)

θ=θ−vt

**Nesterov Accelerated Gradient**

Nesterov found the problem with Momentum that when the Momentum reaches its local optima due to the momentum it pass through the local optima. In order to solve the problem Nesterov computes θ−γv(t−1) which gives the approximate of the next position of the parameters. The rough idea of parameter is going.

v(t)=γv(t−1)+η∇θJ(θ−γv(t−1))

θ=θ−vt

While Nesterov first compute the current gradient and then take a big jump in the direction of the accumulated gradient. NAG compute the previous accumulate gradient. Now NAG measures the gradient and make correction which results in updates. This help prevent us going too fast and thus increase the responsiveness.

**AdaGrad**

It adopts the learning rate to the parameters, perform large update for infrequent parameters and small updates for frequent parameters. Previously, we were using the same learning rate for very update in parameter. As AdaGrad uses different learning rate for every parameters in every time step (t)

It can represent has θ(t+1,i) = θ(t,i) − η⋅g(t,i)θ(t+1,i)

Where g(t,i) is the gradient with respect to parameter θ(i) at time t. In the update rule, AdaGrad updates learning rate at each step time (t) based on the past gradient descent that had been computed for θ(i).

G(t,ii) is the diagonal matrix where each diagonal element i,i is the sum of the squares of the gradient descent with respect to time (t) and parameter θ(i). While ϵ is the smoothing term to avoid divide by zero.

Main benefit of AdaGrad is that it remove the needs for setting the learning rate. But note that the accumulated sum in the denominator keeps growing as they are positive therefore the learning rate keeps on shrinking and eventually algorithm is not able to learn more.

**AdaDelta**

It is the extension of AdaGrad. AdaDelta solves the problem of decaying learning rate. In AdaDelta instead of accumulating all past Gradient descent, Adadelta restricts the window of accumulated past gradients to some fixed size w. It uses the decaying average of all the past gradient descent. It only depends on the previous average and the current gradient descent.

Fixed windows can be represented as E[g^2](t)=γE[g^2](t−1)+(1−γ)g(t)^2 . Where γ is same as in Momentum, and its value is usually is 0.9. E is the result of previous gradient average with current average. Now we can replace G(t) in the AdaGrad with E[g^2]

Thus becomes,

Same as AdaGrad, we don’t need to add learning rate furthermore, it also fixes the problem of decaying learning rate in AdaGrad.

**Adam**

Adam (adaptive moment estimate) is another method that computes adaptive learning rate for every parameter. We can see Adam using both Momentum and AdaDelta techniques. In Adam, keeping an exponentially decaying average of past gradient mt and storing an exponentially decaying average of past square gradient vt.

We compute the decaying averages of past and past squared gradient mt and vt respectively.

m(t) =β(1)m(t−1)+(1−β1)g(t)

v(t) = β(2)v(t−1) + (1−β2)g^2

The author of Adam observers that these mt and vt are biased towards zero especially in initial time and when the decay rate are small

They used to update parameter as you seen in other gradient descent.

Reference:

http://ruder.io/optimizing-gradient-descent/