CNNs are similar to other neural networks, but they have an added layer of complexity due to the fact that they use a series of convolutional layers. Convolutional layers perform a mathematical operation called convolution, a sort of specialized matrix multiplication, on the input data. The convolution operation helps to preserve the spatial relationship between pixels by learning image features using small squares of input data. . The picture below represents a typical CNN architecture.
What is CNN?
Before we get to different types of CNN architecture, letâs quickly recall what a CNN is? What a CNN model is? What are the most fundamental components of a CNN architecture?
Convolutional Neural Networks, commonly referred to as CNNs, are a specialized kind of neural network architecture that is designed to process data with a grid-like topology. This makes them particularly well-suited for dealing with spatial and temporal data, like images and videos, that maintain a high degree of correlation between adjacent elements.
CNNs are similar to other neural networks, but they have an added layer of complexity due to the fact that they use a series of convolutional layers. Convolutional layers perform a mathematical operation called convolution, a sort of specialized matrix multiplication, on the input data. The convolution operation helps to preserve the spatial relationship between pixels by learning image features using small squares of input data. . The picture below represents a typical CNN architecture
Typical CNN architecture
The following are definitions of different layers shown in the above architecture: Convolutional layers
Convolutional layers
Operate by sliding a set of âfiltersâ or âkernelsâ across the input data. Each filter is designed to detect a specific feature or pattern, such as edges, corners, or more complex shapes in the case of deeper layers. As these filters move across the image, they generate a map that signifies the areas where those features were found. The output of the convolutional layer is a feature map, which is a representation of the input image with the filters applied. Convolutional layers can be stacked to create more complex models, which can learn more intricate features from images. Simply speaking, convolutional layers are responsible for extracting features from the input images. These features might include edges, corners, textures, or more complex patterns. Pooling layers
Pooling layers
Follow the convolutional layers and are used to reduce the spatial dimension of the input, making it easier to process and requiring less memory. In the context of images, âspatial dimensionsâ refer to the width and height of the image. An image is made up of pixels, and you can think of it like a grid, with rows and columns of tiny squares (pixels). By reducing the spatial dimensions, pooling layers help reduce the number of parameters or weights in the network. This helps to combat overfitting and help train the model in a fast manner. Max pooling helps in reducing computational complexity owing to reduction in size of feature map, and, making the model invariant to small transitions. Without max pooling, the network would not gain the ability to recognize features irrespective of small shifts or rotations. This would make the model less robust to variations in object positioning within the image, possibly affecting accuracy.
There are two main types of pooling: max pooling and average pooling. Max pooling takes the maximum value from each feature map. For example, if the pooling window size is 2Ă2, it will pick the pixel with the highest value in that 2Ă2 region. Max pooling effectively captures the most prominent feature or characteristic within the pooling window. Average pooling calculates the average of all values within the pooling window. It provides a smooth, average feature representation. Fully connected layers
Fully-connected layers
One of the most basic types of layers in a convolutional neural network (CNN). As the name suggests, each neuron in a fully-connected layer is Fully connected- to every other neuron in the previous layer. Fully connected layers are typically used towards the end of a CNN- when the goal is to take the features learned by the convolutional and max pooling layers and use them to make predictions such as classifying the input to a label. For example, if we were using a CNN to classify images of animals, the final Fully connected layer might take the features learned by the previous layers and use them to classify an image as containing a dog, cat, bird, etc.
Fully connected layers take the high-dimensional output from the previous convolutional and pooling layers and flatten it into a one-dimensional vector. This allows the network to combine and integrate all the extracted features across the entire image, rather than considering localized features. It helps in understanding the global context of the image. The fully connected layers are responsible for mapping the integrated features to the desired output, such as class labels in classification tasks. They act as the final decision-making part of the network, determining what the extracted features mean in the context of the specific problem (e.g., recognizing a cat or a dog).
The combination of Convolution layer followed by max-pooling layer and then similar sets creates a hierarchy of features. The first layer detects simple patterns, and subsequent layers build on those to detect more complex patterns. Output Layer
The output layer
In a Convolutional Neural Network (CNN) plays a critical role as itâs the final layer that produces the actual output of the network, typically in the form of a classification or regression result. Its importance can be outlined as follows:
Transformation of Features to Final Output:
The earlier layers of the CNN (convolutional, pooling, and fully connected layers) are responsible for extracting and transforming features from the input data. The output layer takes these high-level, abstracted features and transforms them into a final output form, which is directly interpretable in the context of the problem being solved.
Task-Specific Formulation:
For classification tasks, the output layer typically uses a softmax activation function, which converts the input from the previous layers into a probability distribution over the predefined classes. The softmax function ensures that the output probabilities sum to 1, making them directly interpretable as class probabilities.
For regression tasks, the output layer might consist of one or more neurons with linear or no activation function, providing continuous output values.
Real-world usage of CNN
CNNs are often used for image recognition and classification tasks. For example, CNNs can be used to identify objects in an image or to classify an image as being a cat or a dog. CNNs can also be used for more complex tasks, such as generating descriptions of an image or identifying the points of interest in an image. Beyond image data, CNNs can also handle time-series data, such as audio data or even text data, although other types of networks like Recurrent Neural Networks (RNNs) or transformers are often preferred for these scenarios. CNNs are a powerful tool for deep learning, and they have been used to achieve state-of-the-art results in many different applications.
A Dive into Function & Code
How CNNs Work
-
Input Layer: The raw image data is passed as input. For example, a color image has dimensions (height, width, channels), e.g., (224, 224, 3).
-
Convolutional Layer: Applies convolution operations to extract features like edges, corners, or textures.
- Mathematical Operation:
-
Activation Function (ReLU): Introduces non-linearity by applying .
-
Pooling Layer: Reduces the spatial dimensions (height and width) while retaining important features.
Common methods:- Max Pooling: Takes the maximum value in a window.
- Average Pooling: Takes the average value in a window.
-
Fully Connected Layer (Dense Layer): Connects all neurons to make predictions.
-
Softmax/Output Layer: Outputs probabilities for classification.
Flow of a CNN
- Input Image
â Convolution (Extract Features)
â ReLU (Non-Linearity)
â Pooling (Downsample)
â Flatten (Convert to 1D)
â Fully Connected Layers
â Output
Python Example: CNN with TensorFlow/Keras
Implementing a CNN for image classification using the MNIST dataset.
1import tensorflow as tf 2from tensorflow.keras import layers, models 3from tensorflow.keras.datasets import mnist 4from tensorflow.keras.utils import to_categorical 5 6# Load MNIST dataset 7(x_train, y_train), (x_test, y_test) = mnist.load_data() 8 9# Preprocess data 10x_train = x_train.reshape((x_train.shape[0], 28, 28, 1)) / 255.0 # Normalize and add channel dimension 11x_test = x_test.reshape((x_test.shape[0], 28, 28, 1)) / 255.0 12y_train = to_categorical(y_train) # Convert labels to one-hot encoding 13y_test = to_categorical(y_test) 14 15# Build CNN model 16model = models.Sequential() 17 18# 1. Convolutional Layer 19model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) 20# Parameters explained: 21# - 32: Number of filters 22# - (3, 3): Size of the filter/kernel 23# - activation='relu': Non-linear activation 24# - input_shape: Shape of input data 25 26# 2. Pooling Layer 27model.add(layers.MaxPooling2D((2, 2))) 28# Parameters explained: 29# - (2, 2): Pooling window size (reduces dimensions by half) 30 31# 3. Another Convolutional Layer 32model.add(layers.Conv2D(64, (3, 3), activation='relu')) 33 34# 4. Another Pooling Layer 35model.add(layers.MaxPooling2D((2, 2))) 36 37# 5. Flatten the output to feed into Dense layers 38model.add(layers.Flatten()) 39 40# 6. Fully Connected Layer 41model.add(layers.Dense(64, activation='relu')) 42 43# 7. Output Layer (10 classes for digits 0-9) 44model.add(layers.Dense(10, activation='softmax')) 45 46# Compile the model 47model.compile(optimizer='adam', 48 loss='categorical_crossentropy', 49 metrics=['accuracy']) 50# Optimizer: Adam for gradient optimization 51# Loss: Categorical cross-entropy for multi-class classification 52# Metrics: Accuracy for evaluation 53 54# Train the model 55model.fit(x_train, y_train, epochs=5, batch_size=64, validation_split=0.2) 56 57# Evaluate the model 58test_loss, test_acc = model.evaluate(x_test, y_test) 59print(f"Test Accuracy: {test_acc}")
Key Concepts
-
Convolutional Layer:
- Detects patterns such as edges, textures, or shapes using small filters.
- Filters are learned during training.
-
ReLU Activation:
- Ensures non-linearity, enabling the model to learn complex features.
-
Pooling Layer:
- Reduces the spatial size, improving computation efficiency and reducing overfitting.
-
Flatten Layer:
- Converts the feature map into a 1D vector to pass into fully connected layers.
-
Dense Layer:
- Makes predictions based on the extracted features.
Visualization of CNN Architecture
- Input Image (28x28x1)
â Conv2D (32 filters, 3x3) â ReLU
â MaxPooling (2x2)
â Conv2D (64 filters, 3x3) â ReLU
â MaxPooling (2x2)
â Flatten
â Dense (64 units) â ReLU
â Dense (10 units) â Softmax
Practical Notes
- Filter Size: Commonly (3x3) or (5x5) for feature extraction.
- Pooling Size: Typically (2x2) for downsampling.
- ReLU: Avoids the vanishing gradient problem compared to sigmoid/tanh.
- Batch Size and Epochs: Control training speed and model convergence.
Summary Table
Layer | Purpose | Parameters |
---|---|---|
Convolutional | Feature extraction | Filters, kernel size, stride, padding |
ReLU | Non-linearity | None |
Pooling | Downsampling | Pooling type (Max/Average), window size |
Flatten | Convert to 1D | None |
Dense | Fully connected for prediction | Number of neurons, activation function |
Softmax | Output probabilities for classification | None |
How Conv2D with multiple filters work in one convolutional layer ?
What Happens in Conv2D?
- When you specify
Conv2D(32, (3, 3))
, it means:- There are 32 filters (kernels) in this layer.
- Each filter has a size of 3x3.
- These filters will be applied to the input simultaneously, not one after the other.
Step-by-Step Explanation
-
Input Dimensions: Let's say the input to the Conv2D layer is an image with shape (28, 28, 1) (height, width, channels). Here,
1
represents a grayscale image with 1 channel. -
Filter Application:
- All 32 filters (each 3x3 in size) are applied to the input image at the same time.
- Each filter slides over the input (using strides) and performs the convolution operation (dot product) at every location. The result of applying one filter is called a feature map.
-
Output (Feature Maps):
- After applying 32 filters, you get 32 feature maps, one for each filter.
- If no padding is used, the dimensions of each feature map will be smaller than the input, calculated as:
For our example:
-
ReLU Activation:
- Once all 32 feature maps are generated, the ReLU activation function is applied to each of them. This operation introduces non-linearity by replacing all negative values with
0
.
- Once all 32 feature maps are generated, the ReLU activation function is applied to each of them. This operation introduces non-linearity by replacing all negative values with
-
Pooling:
- After ReLU, the feature maps are passed to the Pooling layer. Pooling reduces the spatial dimensions (e.g., from to ) while keeping the number of feature maps (32) the same.
Key Points to Remember
- Filters are applied simultaneously, not sequentially.
- Each filter extracts a specific feature (e.g., edges, corners, textures) from the input.
- The depth of the output (number of feature maps) equals the number of filters.
- ReLU is applied after convolution to introduce non-linearity.
- Pooling reduces spatial dimensions but does not change the depth.
Example for Better Visualization
Letâs take an input image with dimensions :
-
Conv2D Layer:
Conv2D(32, (3, 3))
- 32 filters (3x3) are applied to the input, producing feature maps.
- ReLU activation is applied to these feature maps.
-
Pooling Layer:
MaxPooling2D((2, 2))
- Max-pooling reduces the spatial size to .
Why This Design?
- Parallel filters enable CNNs to learn multiple features (e.g., horizontal edges, vertical edges, textures) at the same time.
- This helps the network learn hierarchical features:
- Lower layers detect basic patterns (e.g., edges).
- Higher layers detect complex patterns (e.g., shapes or objects).
Great question! Let's dive deeper into strides and padding, and explain what it means when filters are applied at the same time in Conv2D.
What Happens When Filters Are Applied "At the Same Time"?
When we say all filters are applied at the same time, we mean:
- Each filter operates on the same input region of the image independently.
- Filters do not conflict with each other because each filter is processing the input in parallel. Thereâs no overlap or interference between filters.
- For each filter, the result of sliding it over the entire image (convolution operation) produces one feature map.
Think of it like running 32 workers (one per filter) to process the same image simultaneously.
Strides: What Are They?
The stride determines how much the filter "jumps" or "shifts" when it slides across the image.
-
Default Stride (1):
The filter moves one pixel at a time (both horizontally and vertically). This ensures maximum overlap between adjacent receptive fields (regions the filter covers). -
Larger Stride (>1):
The filter skips pixels as it slides. For example, with a stride of , the filter moves 2 pixels at a time. This results in:- Fewer computations.
- A smaller output feature map (reduced spatial dimensions).
Stride Formula:
The output size with stride is:
For example:
- Input:
- Filter:
- Stride:
Padding: What Is It?
Padding refers to adding extra pixels around the edges of the input image. This is used to control the size of the output feature map.
-
Valid Padding (No Padding):
- No extra pixels are added.
- The filter only slides over the valid region of the input.
- Results in a smaller feature map.
- Example: image with filters â feature map.
-
Same Padding (Zero Padding):
- Adds zero-pixels around the image edges to ensure the output has the same spatial dimensions as the input.
- Formula for padding width:
- Example: image with filters â feature map.
How Strides and Padding Apply in Our Case?
1Conv2D(32, (3, 3), strides=1, padding='valid')
-
Strides = 1:
- The filter moves one pixel at a time (maximum overlap between regions).
- The output feature map is (calculated using the formula without padding).
-
Padding = 'valid':
- No padding is added, so the edges of the image are not included in the convolution.
If we change to:
1Conv2D(32, (3, 3), strides=2, padding='same')
-
Strides = 2:
- The filter moves 2 pixels at a time, reducing spatial dimensions (fewer overlaps).
- The output feature map becomes .
-
Padding = 'same':
- Zero-padding is added to maintain the original spatial dimensions where possible.
Visualizing the Process
Letâs break down how filters work with strides and padding:
Case 1: Strides = 1, Padding = 'valid'
- The first filter processes the entire image and produces its own feature map.
- The second filter does the same, producing another feature map.
- This happens for all 32 filters, resulting in .
Case 2: Strides = 2, Padding = 'same'
- The first filter skips every other pixel, producing a smaller feature map.
- The second filter does the same, producing another feature map.
- Again, this happens for all 32 filters, resulting in .
Python Code Example
Hereâs a commented Python example using TensorFlow/Keras:
1import tensorflow as tf 2from tensorflow.keras.models import Sequential 3from tensorflow.keras.layers import Conv2D, MaxPooling2D 4 5# Create a simple Sequential model 6model = Sequential() 7 8# Add a Conv2D layer 9# 32 filters, 3x3 kernel, stride of 1, 'same' padding 10model.add(Conv2D(32, (3, 3), strides=1, padding='same', input_shape=(28, 28, 1))) 11 12# Add a MaxPooling layer 13# Pooling reduces spatial dimensions (2x2 pool size) 14model.add(MaxPooling2D(pool_size=(2, 2))) 15 16# Summary of the model 17model.summary()
Key Takeaways
- Filters in Conv2D are applied simultaneously, with each filter producing one feature map.
- Strides control how far the filter moves with each step, influencing the size of the output.
- Padding determines whether or not the edges of the input are included in the convolution.
ConvLayers & PoolingLayers: Strides & Padding - Analytical Approach
1. Output Dimensions for Convolutional Layers
Case 1: When Padding is Present (No Stride)
The formula to compute the output dimensions:
Where:
- = Padding size (number of pixels added to the edges of the input).
- = Filter (kernel) size.
- (stride is 1).
Case 2: When Padding and Stride Are Present
The formula becomes:
Where:
- = Padding size.
- = Filter size.
- = Stride (step size of the filter).
Case 3: When Stride Only (No Padding)
If (no padding):
2. Output Dimensions for Pooling Layers
For both Max Pooling and Average Pooling, the output dimensions depend on:
Cases:
- Padding is Present (Same Pooling):
- If padding is applied, the spatial dimensions are maintained, so:
- Padding + Stride:
- Use the general formula:
- Stride Only (No Padding):
- The formula simplifies to:
3. Example: Moving from Using Average Pooling
To achieve this, you can use Global Average Pooling. Here's how it works:
- The pool size matches the input size ().
- Stride and padding are irrelevant since we collapse everything into a .
Python Code Example:
1import tensorflow as tf 2from tensorflow.keras.models import Sequential 3from tensorflow.keras.layers import AveragePooling2D, GlobalAveragePooling2D 4 5# Sequential model 6model = Sequential() 7 8# Input shape is 7x7x1000 9input_shape = (7, 7, 1000) 10 11# Add Global Average Pooling layer 12model.add(GlobalAveragePooling2D(input_shape=input_shape)) 13 14# Show summary 15model.summary()
Explanation:
GlobalAveragePooling2D
computes the average of every region for each of the 1000 channels, resulting in a output.
Alternatively, you can explicitly use AveragePooling2D:
1model.add(AveragePooling2D(pool_size=(7, 7), strides=(1, 1), input_shape=(7, 7, 1000)))
4. Best Practices for Computation Exercises and Choosing Parameters
When Solving Computation Problems:
-
Start With the Input Dimensions:
- Clearly write down the input size, filter size, padding, and stride for each layer.
- Apply the formulas step by step (use a table if needed).
-
Break the Problem Into Stages:
- Compute the output dimensions for each layer sequentially.
- Write intermediate results for clarity.
-
Verify Edge Cases:
- Check how padding affects small inputs (e.g., ).
- Confirm no fractional dimensions exist after applying formulas.
-
Remember Default Values:
- Padding defaults to
'valid'
(no padding) unless specified. - Stride defaults to .
- Padding defaults to
When Choosing Parameters in Code:
-
Filter Size:
- Use small filters ( or ) for most tasks.
- Smaller filters capture fine details and stack well for hierarchical feature extraction.
-
Padding:
- Use
'same'
padding if you want to preserve spatial dimensions. - Use
'valid'
padding for smaller outputs and reduced computation.
- Use
-
Stride:
- Default stride () gives better granularity.
- Larger strides reduce computation but may lose detail.
-
Pooling:
- Use Max Pooling for feature extraction (reduces noise).
- Use Average Pooling for global aggregation (e.g., transitioning to dense layers).
-
Batch Size:
- Start with or , and adjust based on memory constraints.
-
Learning Rate:
- Start small () and use learning rate schedulers for adjustments.