HackGT 6 Admissions

In these past few months, our organizers have had the opportunity to live vicariously through each and every applicant throughout our admissions process. We are excited about the projects you’ve…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Momentum Method and Nesterov Accelerated Gradient

In the previous post, Gradient Descent and Stochastic Gradient Descent Algorithms for Neural Networks, we have discussed how Stochastic Gradient Descent algorithm can improve plain Gradient Descent algorithm in terms of complexity and generalization. We recall the update rule of Stochastic Gradient Descent at each iteration t,

where θ is the parameter that the algorithm will tweak to achieve acceptable loss, g is the gradients that show the opposite direction and how far should we tweak the parameters to minimize the loss, and η is the learning rate, a hyper-parameter that denotes how much we tweak our parameters with respect to the gradients, or how much we trust the gradients.

Unfortunately, there is some condition that even Stochastic Gradient Descent can become very slow, for example when the gradient is consistently small. This is due to the update rule of the algorithm that only depends on the gradients at each iteration only, as shown in equation (1). Noisy gradients can be a problem too since Stochastic Gradient Descent will frequently follow the wrong gradient.

In neural networks training, Momentum method can be used to mitigate these problems and accelerate learning compared to plain Stochastic Gradient Descent. Introduced in 1964 by Polyak, Momentum method is a technique that can accelerate gradient descent by taking accounts of previous gradients in the update rule at each iteration. This can be clearly observed in the update rule equation in every iteration, that is

where v is the velocity term, the direction and speed at which the parameter should be tweaked and α is the decaying hyper-parameter, which determines how quickly accumulated previous gradients will decay, or how much we trust the accumulated previous gradients. If α is much bigger than η, the accumulated previous gradients will be dominant in the update rule so the gradient at the iteration will not change the current direction quickly. This is good in the condition when the gradient is noisy because the gradient will stay in the true direction for good. In the other hand, if α is much smaller than η, the accumulated previous gradients can act as a smoothing factor for the gradient.

Another method that is closely related to Momentum method is Nesterov Accelerated Gradient. The difference between Momentum method and Nesterov Accelerated Gradient is in gradient computation phase. In Momentum method, the gradient was computed using current parameters θ𝑡

whereas in Nesterov Accelerated Gradient, we apply the velocity vt to the parameters θ to compute interim parameters θ̃ . We then compute the gradient using the interim parameters

After we get the gradients, we update the parameters using similar update rule with the Momentum method (Eq. 2), with the only difference is the gradients

We can view Nesterov Accelerated Gradients as the correction factor for Momentum method. Consider the case when the velocity added to the parameters gives you immediate unwanted high loss, e.g., exploding gradient case. In this case, the Momentum method can be very slow since the optimization path taken exhibits large oscillations. In Nesterov Accelerated Gradient case, you can view it like peeking through the interim parameters where the added velocity will lead the parameters. If the velocity update leads to bad loss, then the gradients will direct the update back towards θ𝑡. This help Nesterov Accelerated Gradient to avoid the oscillations. The geometric visualization can be seen in Fig 1.

The distinction between Momentum method and Nesterov Accelerated Gradient updates was shown by Sutskever et al. in Theorem 2.1, i.e., both methods are distinct only when the learning rate η is reasonably large. When the learning rate η is relatively large, Nesterov Accelerated Gradients allows larger decay rate α than Momentum method, while preventing oscillations. The theorem also shows that both Momentum method and Nesterov Accelerated Gradient become equivalent when η is small.

HackGT 6 Admissions

Momentum Method and Nesterov Accelerated Gradient

Add a comment

Related posts:

Cross Country Road Trip In Hearse Is To Die For

Things You Should Know About Serverless APIs

The Workings of Voice Broadcasting