Optimization Algorithms for Neural Networks: From Gradient Descent to Adam

Training neural networks effectively requires choosing the right optimization algorithm. Each optimizer adjusts weights and biases to minimize the loss function by leveraging gradients in different ways. Below we explore classic gradient descent variants, introduce momentum-based methods, and dive into adaptive learning rate strategies.

Batch Gradient Descent

Batch Gradient Descent computes the gradient across the entire dataset at each step. In every iteration, the parameters θ\theta are updated by subtracting the learning rate η\eta multiplied by the full gradient L(θ)\nabla L(\theta):

θθηL(θ),\theta \leftarrow \theta - \eta \nabla L(\theta),

where the gradient itself is averaged over all mm training examples:

L(θ)=1mi=1mLi(θ).\nabla L(\theta) = \frac{1}{m} \sum_{i=1}^m \nabla L_i(\theta).

This method yields smooth, accurate convergence because each update reflects the true direction of descent. However, processing the entire dataset on every step becomes prohibitively slow as data size grows.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent accelerates training by updating parameters using only one randomly selected example per iteration. The update rule simplifies to:

θθηLi(θ)\theta \leftarrow \theta - \eta \nabla L_i(\theta)

for a random index ii. By avoiding full-batch computations, SGD offers fast, incremental updates suited for large datasets. Its drawback is noisy convergence: high variance in each step can cause oscillations around the minimum.

Mini-Batch Gradient Descent

Mini‑batch Gradient Descent strikes a balance between batch and stochastic approaches. At each step, it uses a small subset BB of the data:

θθη1BiBLi(θ).\theta \leftarrow \theta - \eta \frac{1}{|B|} \sum_{i \in B} \nabla L_i(\theta).

Choosing the right batch size provides both training stability and efficiency, making this the most common method in deep learning libraries.

Momentum and Adaptive Methods

Beyond basic GD variants, momentum and adaptive techniques address challenges like slow progress through flat regions or varying gradient scales.

SGD with Momentum introduces a velocity term vtv_t that accumulates gradients over time:

vt=βvt1+(1β)L(θt),θt+1=θtηvt.v_t = \beta v_{t-1} + (1 - \beta) \nabla L(\theta_t), \quad \theta_{t+1} = \theta_t - \eta v_t.

This dampens oscillations in high‑curvature or saddle regions and speeds up convergence in consistent downhill directions. An illustrative diagram of momentum helping to escape saddle points is shown below.

Momentum vs Saddle Point

Adam (Adaptive Moment Estimation) combines momentum and per-parameter adaptive learning rates. It tracks the first moment (mean) mtm_t and second moment (variance) vtv_t of gradients:

mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gt2,m^t=mt1β1t,v^t=vt1β2t,θt+1=θtηm^tv^t+ϵ.\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t, \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2, \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}, \\ \theta_{t+1} &= \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}. \end{aligned}

By adaptively scaling each parameter’s step size and incorporating momentum, Adam achieves robust performance across sparse, noisy, and non-convex landscapes.

Adagrad, Adadelta, and RMSProp: Adaptive Learning Rates

Adaptive algorithms adjust learning rates based on gradient history, improving convergence in various scenarios.

Adagrad accumulates the sum of squared gradients Gt[i]G_t[i] for each parameter ii:

Gt[i]=Gt1[i]+gt[i]2,G_t[i] = G_{t-1}[i] + g_t[i]^2, θt+1[i]=θt[i]ηGt[i]+ϵ  gt[i].\theta_{t+1}[i] = \theta_t[i] - \frac{\eta}{\sqrt{G_t[i]} + \epsilon} \; g_t[i].

While effective for sparse data, its continual decay of learning rates may slow training excessively.

Adadelta overcomes Adagrad’s diminishing rates by maintaining a decaying average of squared gradients and updates:

E[g2]t=ρE[g2]t1+(1ρ)gt2,E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) g_t^2, θt+1=θtΔθt+ϵE[g2]t+ϵ  gt.\theta_{t+1} = \theta_t - \frac{\sqrt{\Delta \theta_t + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} \; g_t.

No manual learning rate is needed, but performance on sparse data can vary.

RMSProp similarly tracks an exponential moving average of squared gradients:

E[g2]t=βE[g2]t1+(1β)gt2,E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2, θt+1=θtηgtE[g2]t+ϵ.\theta_{t+1} = \theta_t - \eta \frac{g_t}{\sqrt{E[g^2]_t + \epsilon}}.

RMSProp excels with non-stationary objectives common in deep networks but requires careful hyperparameter tuning.

Comparison Table

OptimizerKey IdeaStrengthsWeaknessesKey Params
AdagradPer-parameter rates scaling by sum of past squaresWorks well for sparse featuresRates decay too much over timeη,ϵ\eta, \epsilon
AdadeltaDecaying average of past squared gradientsNo learning rate to tuneVaries on sparse dataρ,ϵ\rho, \epsilon
RMSPropExponential average of squared gradientsHandles non-stationary targetsSensitive to decay rateη,β,ϵ\eta, \beta, \epsilon
AdamCombines momentum and RMSProp variance scalingRobust to noisy gradientsNeeds hyperparameter tuningη,β1,β2,ϵ\eta, \beta_1, \beta_2, \epsilon