Optimization Algorithms for Neural Networks: From Gradient Descent to Adam

Training neural networks effectively requires choosing the right optimization algorithm. Each optimizer adjusts weights and biases to minimize the loss function by leveraging gradients in different ways. Below we explore classic gradient descent variants, introduce momentum-based methods, and dive into adaptive learning rate strategies.

Batch Gradient Descent

Batch Gradient Descent computes the gradient across the entire dataset at each step. In every iteration, the parameters $\theta$ are updated by subtracting the learning rate $\eta$ multiplied by the full gradient $\nabla L(\theta)$ :

\theta \leftarrow \theta - \eta \nabla L(\theta),

where the gradient itself is averaged over all $m$ training examples:

\nabla L(\theta) = \frac{1}{m} \sum_{i=1}^m \nabla L_i(\theta).

This method yields smooth, accurate convergence because each update reflects the true direction of descent. However, processing the entire dataset on every step becomes prohibitively slow as data size grows.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent accelerates training by updating parameters using only one randomly selected example per iteration. The update rule simplifies to:

\theta \leftarrow \theta - \eta \nabla L_i(\theta)

for a random index $i$ . By avoiding full-batch computations, SGD offers fast, incremental updates suited for large datasets. Its drawback is noisy convergence: high variance in each step can cause oscillations around the minimum.

Mini-Batch Gradient Descent

Mini‑batch Gradient Descent strikes a balance between batch and stochastic approaches. At each step, it uses a small subset $B$ of the data:

\theta \leftarrow \theta - \eta \frac{1}{|B|} \sum_{i \in B} \nabla L_i(\theta).

Choosing the right batch size provides both training stability and efficiency, making this the most common method in deep learning libraries.

Momentum and Adaptive Methods

Beyond basic GD variants, momentum and adaptive techniques address challenges like slow progress through flat regions or varying gradient scales.

SGD with Momentum introduces a velocity term $v_t$ that accumulates gradients over time:

v_t = \beta v_{t-1} + (1 - \beta) \nabla L(\theta_t), \quad \theta_{t+1} = \theta_t - \eta v_t.

This dampens oscillations in high‑curvature or saddle regions and speeds up convergence in consistent downhill directions. An illustrative diagram of momentum helping to escape saddle points is shown below.

Momentum vs Saddle Point

Adam (Adaptive Moment Estimation) combines momentum and per-parameter adaptive learning rates. It tracks the first moment (mean) $m_t$ and second moment (variance) $v_t$ of gradients:

\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t, \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2, \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}, \\ \theta_{t+1} &= \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}. \end{aligned}

By adaptively scaling each parameter’s step size and incorporating momentum, Adam achieves robust performance across sparse, noisy, and non-convex landscapes.

Adagrad, Adadelta, and RMSProp: Adaptive Learning Rates

Adaptive algorithms adjust learning rates based on gradient history, improving convergence in various scenarios.

Adagrad accumulates the sum of squared gradients $G_t[i]$ for each parameter $i$ :

G_t[i] = G_{t-1}[i] + g_t[i]^2,

\theta_{t+1}[i] = \theta_t[i] - \frac{\eta}{\sqrt{G_t[i]} + \epsilon} \; g_t[i].

While effective for sparse data, its continual decay of learning rates may slow training excessively.

Adadelta overcomes Adagrad’s diminishing rates by maintaining a decaying average of squared gradients and updates:

E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) g_t^2,

\theta_{t+1} = \theta_t - \frac{\sqrt{\Delta \theta_t + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} \; g_t.

No manual learning rate is needed, but performance on sparse data can vary.

RMSProp similarly tracks an exponential moving average of squared gradients:

E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2,

\theta_{t+1} = \theta_t - \eta \frac{g_t}{\sqrt{E[g^2]_t + \epsilon}}.

RMSProp excels with non-stationary objectives common in deep networks but requires careful hyperparameter tuning.

Comparison Table

Optimizer	Key Idea	Strengths	Weaknesses	Key Params
Adagrad	Per-parameter rates scaling by sum of past squares	Works well for sparse features	Rates decay too much over time	$\eta, \epsilon$
Adadelta	Decaying average of past squared gradients	No learning rate to tune	Varies on sparse data	$\rho, \epsilon$
RMSProp	Exponential average of squared gradients	Handles non-stationary targets	Sensitive to decay rate	$\eta, \beta, \epsilon$
Adam	Combines momentum and RMSProp variance scaling	Robust to noisy gradients	Needs hyperparameter tuning	$\eta, \beta_1, \beta_2, \epsilon$