Optimization Algorithms for Neural Networks: From Gradient Descent to Adam
Training neural networks effectively requires choosing the right optimization algorithm. Each optimizer adjusts weights and biases to minimize the loss function by leveraging gradients in different ways. Below we explore classic gradient descent variants, introduce momentum-based
methods, and dive into adaptive learning rate
strategies.
Batch Gradient Descent
Batch Gradient Descent computes the gradient across the entire dataset at each step. In every iteration, the parameters are updated by subtracting the learning rate multiplied by the full gradient :
where the gradient itself is averaged over all training examples:
This method yields smooth, accurate convergence because each update reflects the true direction of descent. However, processing the entire dataset on every step becomes prohibitively slow as data size grows.
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent accelerates training by updating parameters using only one randomly selected example per iteration. The update rule simplifies to:
for a random index . By avoiding full-batch computations, SGD offers fast, incremental updates suited for large datasets. Its drawback is noisy convergence: high variance in each step can cause oscillations around the minimum.
Mini-Batch Gradient Descent
Mini‑batch Gradient Descent strikes a balance between batch and stochastic approaches. At each step, it uses a small subset of the data:
Choosing the right batch size provides both training stability and efficiency, making this the most common method in deep learning libraries.
Momentum and Adaptive Methods
Beyond basic GD variants, momentum and adaptive techniques address challenges like slow progress through flat regions or varying gradient scales.
SGD with Momentum introduces a velocity term that accumulates gradients over time:
This dampens oscillations in high‑curvature or saddle regions and speeds up convergence in consistent downhill directions. An illustrative diagram of momentum helping to escape saddle points is shown below.
Adam (Adaptive Moment Estimation) combines momentum and per-parameter adaptive learning rates. It tracks the first moment (mean) and second moment (variance) of gradients:
By adaptively scaling each parameter’s step size and incorporating momentum, Adam achieves robust performance across sparse, noisy, and non-convex landscapes.
Adagrad, Adadelta, and RMSProp: Adaptive Learning Rates
Adaptive algorithms adjust learning rates based on gradient history, improving convergence in various scenarios.
Adagrad accumulates the sum of squared gradients for each parameter :
While effective for sparse data, its continual decay of learning rates may slow training excessively.
Adadelta overcomes Adagrad’s diminishing rates by maintaining a decaying average of squared gradients and updates:
No manual learning rate is needed, but performance on sparse data can vary.
RMSProp similarly tracks an exponential moving average of squared gradients:
RMSProp excels with non-stationary objectives common in deep networks but requires careful hyperparameter tuning.
Comparison Table
Optimizer | Key Idea | Strengths | Weaknesses | Key Params |
---|---|---|---|---|
Adagrad | Per-parameter rates scaling by sum of past squares | Works well for sparse features | Rates decay too much over time | |
Adadelta | Decaying average of past squared gradients | No learning rate to tune | Varies on sparse data | |
RMSProp | Exponential average of squared gradients | Handles non-stationary targets | Sensitive to decay rate | |
Adam | Combines momentum and RMSProp variance scaling | Robust to noisy gradients | Needs hyperparameter tuning |