Understanding Long Short-Term Memory (LSTM) Networks

Long Short-Term Memory (LSTM) networks are a specialized type of recurrent neural network (RNN) designed to address the vanishing gradient problem and capture long-range dependencies in sequential data. At their core, LSTMs manage information flow through a system of gates and memory states, enabling precise control over what is retained, updated, and output over time. This article provides a comprehensive breakdown of LSTM mechanics, focusing on their gates, hidden and cell states, mathematical foundations, and practical applications.

LSTM Diagram

Core Components of an LSTM

An LSTM processes sequential data one timestep at a time, using three critical gates—forget gate, input gate, and output gate—to regulate the flow of information. These gates work in tandem with two internal states: the hidden state and the cell state.

1. The Forget Gate: Filtering Irrelevant Information

The forget gate determines which parts of the long-term memory (cell state) should be discarded or retained. It takes the previous hidden state $h_{t-1}$ and the current input $x_t$ , concatenates them, and applies a sigmoid activation to produce values between 0 and 1. These values act as filters:

Values close to 1 indicate information to keep.
Values close to 0 indicate information to discard.

For example, in a sentence like "The cat, which was hungry, sat on the mat," the forget gate might retain "cat" and "hungry" while discarding less relevant details as the sentence progresses. Mathematically, this is expressed as:

f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Here, $f_t$ is the forget gate’s output, $W_f$ and $b_f$ are learnable weights and biases, and $\sigma$ is the sigmoid function. The previous cell state $C_{t-1}$ is then multiplied element-wise by $f_t$ , selectively erasing outdated information.

2. The Input Gate: Updating Memory with New Information

The input gate decides what new information to add to the cell state. It has two components:

A sigmoid layer that identifies which values to update.
A tanh layer that generates candidate values $\tilde{C}_t$ for addition to the cell state.

The updated cell state $C_t$ combines the filtered past memory (from the forget gate) and the new candidate values:

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Here, $i_t$ is the output of the input gate’s sigmoid layer, and $\odot$ denotes element-wise multiplication. This step ensures the cell state evolves by integrating relevant new context while preserving essential long-term information.

3. The Output Gate: Generating the Hidden State

The output gate controls what information from the cell state is exposed as the hidden state $h_t$ . It uses a sigmoid layer to decide which parts of the cell state to highlight, then multiplies this by a tanh-scaled version of the cell state:

o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

h_t = o_t \odot \tanh(C_t)

The hidden state serves as a filtered summary of the cell state, containing only the information relevant to the current timestep. For instance, in a translation task, $h_t$ might focus on the subject of a sentence while downplaying less critical details.

Hidden State vs. Cell State: Roles and Interactions

Hidden State $h_t$ : The Short-Term Messenger

The hidden state acts as the LSTM’s interface with the external world. At each timestep, it is passed to the next layer or used for predictions, encapsulating the immediate context needed for the task. Think of $h_t$ as a snapshot of the cell state after being refined by the output gate—a concise representation of what matters "right now."

Cell State $C_t$ : The Long-Term Memory Bank

The cell state functions as the LSTM’s persistent memory, carrying information across many timesteps. It is updated sequentially but never directly exposed, allowing it to maintain a coherent narrative of the sequence. For example, in a story generation task, $C_t$ might track overarching plot points, while $h_t$ focuses on the current sentence.

Mathematical Foundations

LSTMs rely on the interplay of sigmoid and tanh activations to balance information retention and flow:

Sigmoid $\sigma$ : Squashes values to [0, 1], ideal for gating (e.g., deciding what to forget).
Tanh: Squashes values to [-1, 1], normalizing candidate values for stable training.

alt text

The equations governing an LSTM cell are:

Forget Gate:

f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Input Gate and Candidate Memory:

i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i), \quad \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

Cell State Update:

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Output Gate and Hidden State:

o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o), \quad h_t = o_t \odot \tanh(C_t)

Why the Hidden State is Output, Not the Cell State

The cell state $C_t$ contains raw, unfiltered memory, which is often too voluminous or irrelevant for immediate tasks. Exposing it directly could overwhelm downstream layers with unnecessary details. Instead, the hidden state $h_t$ provides a distilled version of $C_t$ , emphasizing context critical to the current timestep. This design mirrors human cognition: while our brains store vast amounts of information, we consciously focus only on what’s immediately relevant.

Applications of LSTMs

LSTMs excel in tasks requiring context retention over long sequences:

Time Series Forecasting: Predicting stock prices or weather patterns by modeling temporal dependencies.
Natural Language Processing (NLP): Machine translation, sentiment analysis, and text generation.
Speech Recognition: Converting audio signals into text by processing phoneme sequences.
Healthcare: Analyzing patient vitals over time for early disease detection.

LSTMs solve the limitations of traditional RNNs through a gated architecture that balances short-term relevance and long-term memory. The forget gate filters outdated information, the input gate integrates new data, and the output gate generates context-aware hidden states. Together, the hidden state $h_t$ and cell state $C_t$ enable LSTMs to handle complex sequential tasks, from language modeling to predictive analytics. By understanding these mechanics, practitioners can better leverage LSTMs in applications demanding robust temporal or contextual understanding.