CNNs are similar to other neural networks, but they have an added layer of complexity due to the fact that they use a series of convolutional layers. Convolutional layers perform a mathematical operation called convolution, a sort of specialized matrix multiplication, on the input data. The convolution operation helps to preserve the spatial relationship between pixels by learning image features using small squares of input data. . The picture below represents a typical CNN architecture.

VGG16

What is CNN?

Before we get to different types of CNN architecture, let’s quickly recall what a CNN is? What a CNN model is? What are the most fundamental components of a CNN architecture?

Convolutional Neural Networks, commonly referred to as CNNs, are a specialized kind of neural network architecture that is designed to process data with a grid-like topology. This makes them particularly well-suited for dealing with spatial and temporal data, like images and videos, that maintain a high degree of correlation between adjacent elements.

CNNs are similar to other neural networks, but they have an added layer of complexity due to the fact that they use a series of convolutional layers. Convolutional layers perform a mathematical operation called convolution, a sort of specialized matrix multiplication, on the input data. The convolution operation helps to preserve the spatial relationship between pixels by learning image features using small squares of input data. . The picture below represents a typical CNN architecture

Typical CNN architecture

The following are definitions of different layers shown in the above architecture: Convolutional layers

Convolutional layers

Operate by sliding a set of ‘filters’ or ‘kernels’ across the input data. Each filter is designed to detect a specific feature or pattern, such as edges, corners, or more complex shapes in the case of deeper layers. As these filters move across the image, they generate a map that signifies the areas where those features were found. The output of the convolutional layer is a feature map, which is a representation of the input image with the filters applied. Convolutional layers can be stacked to create more complex models, which can learn more intricate features from images. Simply speaking, convolutional layers are responsible for extracting features from the input images. These features might include edges, corners, textures, or more complex patterns. Pooling layers

Pooling layers

Follow the convolutional layers and are used to reduce the spatial dimension of the input, making it easier to process and requiring less memory. In the context of images, “spatial dimensions” refer to the width and height of the image. An image is made up of pixels, and you can think of it like a grid, with rows and columns of tiny squares (pixels). By reducing the spatial dimensions, pooling layers help reduce the number of parameters or weights in the network. This helps to combat overfitting and help train the model in a fast manner. Max pooling helps in reducing computational complexity owing to reduction in size of feature map, and, making the model invariant to small transitions. Without max pooling, the network would not gain the ability to recognize features irrespective of small shifts or rotations. This would make the model less robust to variations in object positioning within the image, possibly affecting accuracy.

There are two main types of pooling: max pooling and average pooling. Max pooling takes the maximum value from each feature map. For example, if the pooling window size is 2×2, it will pick the pixel with the highest value in that 2×2 region. Max pooling effectively captures the most prominent feature or characteristic within the pooling window. Average pooling calculates the average of all values within the pooling window. It provides a smooth, average feature representation. Fully connected layers

Fully-connected layers

One of the most basic types of layers in a convolutional neural network (CNN). As the name suggests, each neuron in a fully-connected layer is Fully connected- to every other neuron in the previous layer. Fully connected layers are typically used towards the end of a CNN- when the goal is to take the features learned by the convolutional and max pooling layers and use them to make predictions such as classifying the input to a label. For example, if we were using a CNN to classify images of animals, the final Fully connected layer might take the features learned by the previous layers and use them to classify an image as containing a dog, cat, bird, etc.

Fully connected layers take the high-dimensional output from the previous convolutional and pooling layers and flatten it into a one-dimensional vector. This allows the network to combine and integrate all the extracted features across the entire image, rather than considering localized features. It helps in understanding the global context of the image. The fully connected layers are responsible for mapping the integrated features to the desired output, such as class labels in classification tasks. They act as the final decision-making part of the network, determining what the extracted features mean in the context of the specific problem (e.g., recognizing a cat or a dog).

The combination of Convolution layer followed by max-pooling layer and then similar sets creates a hierarchy of features. The first layer detects simple patterns, and subsequent layers build on those to detect more complex patterns. Output Layer

The output layer

In a Convolutional Neural Network (CNN) plays a critical role as it’s the final layer that produces the actual output of the network, typically in the form of a classification or regression result. Its importance can be outlined as follows:

Transformation of Features to Final Output:

The earlier layers of the CNN (convolutional, pooling, and fully connected layers) are responsible for extracting and transforming features from the input data. The output layer takes these high-level, abstracted features and transforms them into a final output form, which is directly interpretable in the context of the problem being solved.

Task-Specific Formulation:

For classification tasks, the output layer typically uses a softmax activation function, which converts the input from the previous layers into a probability distribution over the predefined classes. The softmax function ensures that the output probabilities sum to 1, making them directly interpretable as class probabilities.

For regression tasks, the output layer might consist of one or more neurons with linear or no activation function, providing continuous output values.

Real-world usage of CNN

CNNs are often used for image recognition and classification tasks. For example, CNNs can be used to identify objects in an image or to classify an image as being a cat or a dog. CNNs can also be used for more complex tasks, such as generating descriptions of an image or identifying the points of interest in an image. Beyond image data, CNNs can also handle time-series data, such as audio data or even text data, although other types of networks like Recurrent Neural Networks (RNNs) or transformers are often preferred for these scenarios. CNNs are a powerful tool for deep learning, and they have been used to achieve state-of-the-art results in many different applications.

A Dive into Function & Code

How CNNs Work

Input Layer: The raw image data is passed as input. For example, a color image has dimensions (height, width, channels), e.g., (224, 224, 3).
Convolutional Layer: Applies convolution operations to extract features like edges, corners, or textures.
- Mathematical Operation:

\text{output}[i, j] = \sum_k \sum_{m,n} \text{input}[i+m, j+n, k] \cdot \text{filter}[m, n, k] + \text{bias}

Activation Function (ReLU): Introduces non-linearity by applying $\text{ReLU}(x) = \max(0, x)$ .
Pooling Layer: Reduces the spatial dimensions (height and width) while retaining important features.
Common methods:
- Max Pooling: Takes the maximum value in a window.
- Average Pooling: Takes the average value in a window.
Fully Connected Layer (Dense Layer): Connects all neurons to make predictions.
Softmax/Output Layer: Outputs probabilities for classification.

Flow of a CNN

Input Image
→ Convolution (Extract Features)
→ ReLU (Non-Linearity)
→ Pooling (Downsample)
→ Flatten (Convert to 1D)
→ Fully Connected Layers
→ Output

Python Example: CNN with TensorFlow/Keras

Implementing a CNN for image classification using the MNIST dataset.

1import tensorflow as tf
2from tensorflow.keras import layers, models
3from tensorflow.keras.datasets import mnist
4from tensorflow.keras.utils import to_categorical
5
6# Load MNIST dataset
7(x_train, y_train), (x_test, y_test) = mnist.load_data()
8
9# Preprocess data
10x_train = x_train.reshape((x_train.shape[0], 28, 28, 1)) / 255.0  # Normalize and add channel dimension
11x_test = x_test.reshape((x_test.shape[0], 28, 28, 1)) / 255.0
12y_train = to_categorical(y_train)  # Convert labels to one-hot encoding
13y_test = to_categorical(y_test)
14
15# Build CNN model
16model = models.Sequential()
17
18# 1. Convolutional Layer
19model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
20# Parameters explained:
21# - 32: Number of filters
22# - (3, 3): Size of the filter/kernel
23# - activation='relu': Non-linear activation
24# - input_shape: Shape of input data
25
26# 2. Pooling Layer
27model.add(layers.MaxPooling2D((2, 2)))
28# Parameters explained:
29# - (2, 2): Pooling window size (reduces dimensions by half)
30
31# 3. Another Convolutional Layer
32model.add(layers.Conv2D(64, (3, 3), activation='relu'))
33
34# 4. Another Pooling Layer
35model.add(layers.MaxPooling2D((2, 2)))
36
37# 5. Flatten the output to feed into Dense layers
38model.add(layers.Flatten())
39
40# 6. Fully Connected Layer
41model.add(layers.Dense(64, activation='relu'))
42
43# 7. Output Layer (10 classes for digits 0-9)
44model.add(layers.Dense(10, activation='softmax'))
45
46# Compile the model
47model.compile(optimizer='adam', 
48              loss='categorical_crossentropy', 
49              metrics=['accuracy'])
50# Optimizer: Adam for gradient optimization
51# Loss: Categorical cross-entropy for multi-class classification
52# Metrics: Accuracy for evaluation
53
54# Train the model
55model.fit(x_train, y_train, epochs=5, batch_size=64, validation_split=0.2)
56
57# Evaluate the model
58test_loss, test_acc = model.evaluate(x_test, y_test)
59print(f"Test Accuracy: {test_acc}")

Key Concepts

Convolutional Layer:
- Detects patterns such as edges, textures, or shapes using small filters.
- Filters are learned during training.
ReLU Activation:
- Ensures non-linearity, enabling the model to learn complex features.
Pooling Layer:
- Reduces the spatial size, improving computation efficiency and reducing overfitting.
Flatten Layer:
- Converts the feature map into a 1D vector to pass into fully connected layers.
Dense Layer:
- Makes predictions based on the extracted features.

Visualization of CNN Architecture

Input Image (28x28x1)
→ Conv2D (32 filters, 3x3) → ReLU
→ MaxPooling (2x2)
→ Conv2D (64 filters, 3x3) → ReLU
→ MaxPooling (2x2)
→ Flatten
→ Dense (64 units) → ReLU
→ Dense (10 units) → Softmax

Practical Notes

Filter Size: Commonly (3x3) or (5x5) for feature extraction.
Pooling Size: Typically (2x2) for downsampling.
ReLU: Avoids the vanishing gradient problem compared to sigmoid/tanh.
Batch Size and Epochs: Control training speed and model convergence.

Summary Table

Layer	Purpose	Parameters
Convolutional	Feature extraction	Filters, kernel size, stride, padding
ReLU	Non-linearity	None
Pooling	Downsampling	Pooling type (Max/Average), window size
Flatten	Convert to 1D	None
Dense	Fully connected for prediction	Number of neurons, activation function
Softmax	Output probabilities for classification	None

How Conv2D with multiple filters work in one convolutional layer ?

What Happens in Conv2D?

When you specify Conv2D(32, (3, 3)), it means:
- There are 32 filters (kernels) in this layer.
- Each filter has a size of 3x3.
- These filters will be applied to the input simultaneously, not one after the other.

Step-by-Step Explanation

Input Dimensions: Let's say the input to the Conv2D layer is an image with shape (28, 28, 1) (height, width, channels). Here, 1 represents a grayscale image with 1 channel.
Filter Application:
- All 32 filters (each 3x3 in size) are applied to the input image at the same time.
- Each filter slides over the input (using strides) and performs the convolution operation (dot product) at every location. The result of applying one filter is called a feature map.
Output (Feature Maps):
- After applying 32 filters, you get 32 feature maps, one for each filter.
- If no padding is used, the dimensions of each feature map will be smaller than the input, calculated as:

\text{Output Height} = \text{Input Height} - \text{Filter Height} + 1

\text{Output Width} = \text{Input Width} - \text{Filter Width} + 1

For our example:

\text{Output Shape} = (28-3+1, 28-3+1, 32) = (26, 26, 32)

ReLU Activation:
- Once all 32 feature maps are generated, the ReLU activation function is applied to each of them. This operation introduces non-linearity by replacing all negative values with 0.
Pooling:
- After ReLU, the feature maps are passed to the Pooling layer. Pooling reduces the spatial dimensions (e.g., from $26 \times 26$ to $13 \times 13$ ) while keeping the number of feature maps (32) the same.

Key Points to Remember

Filters are applied simultaneously, not sequentially.
Each filter extracts a specific feature (e.g., edges, corners, textures) from the input.
The depth of the output (number of feature maps) equals the number of filters.
ReLU is applied after convolution to introduce non-linearity.
Pooling reduces spatial dimensions but does not change the depth.

Example for Better Visualization

Let’s take an input image with dimensions $28 \times 28 \times 1$ :

Conv2D Layer: Conv2D(32, (3, 3))
- 32 filters (3x3) are applied to the input, producing $26 \times 26 \times 32$ feature maps.
- ReLU activation is applied to these feature maps.
Pooling Layer: MaxPooling2D((2, 2))
- Max-pooling reduces the spatial size to $13 \times 13 \times 32$ .

Why This Design?

Parallel filters enable CNNs to learn multiple features (e.g., horizontal edges, vertical edges, textures) at the same time.
This helps the network learn hierarchical features:
- Lower layers detect basic patterns (e.g., edges).
- Higher layers detect complex patterns (e.g., shapes or objects).

Great question! Let's dive deeper into strides and padding, and explain what it means when filters are applied at the same time in Conv2D.

What Happens When Filters Are Applied "At the Same Time"?

When we say all filters are applied at the same time, we mean:

Each filter operates on the same input region of the image independently.
Filters do not conflict with each other because each filter is processing the input in parallel. There’s no overlap or interference between filters.
For each filter, the result of sliding it over the entire image (convolution operation) produces one feature map.

Think of it like running 32 workers (one per filter) to process the same image simultaneously.

Strides: What Are They?

The stride determines how much the filter "jumps" or "shifts" when it slides across the image.

Default Stride (1):
The filter moves one pixel at a time (both horizontally and vertically). This ensures maximum overlap between adjacent receptive fields (regions the filter covers).
Larger Stride (>1):
The filter skips pixels as it slides. For example, with a stride of $2$ , the filter moves 2 pixels at a time. This results in:
- Fewer computations.
- A smaller output feature map (reduced spatial dimensions).

Stride Formula:

The output size with stride $S$ is:

\text{Output Size} = \left\lfloor \frac{\text{Input Size} - \text{Filter Size}}{\text{Stride}} \right\rfloor + 1

For example:

Input: $28 \times 28 \times 1$
Filter: $3 \times 3$
Stride: $2$

\text{Output Size} = \left\lfloor \frac{28 - 3}{2} \right\rfloor + 1 = 13 \times 13

Padding: What Is It?

Padding refers to adding extra pixels around the edges of the input image. This is used to control the size of the output feature map.

Valid Padding (No Padding):
- No extra pixels are added.
- The filter only slides over the valid region of the input.
- Results in a smaller feature map.
- Example: $28 \times 28$ image with $3 \times 3$ filters → $26 \times 26$ feature map.
Same Padding (Zero Padding):
- Adds zero-pixels around the image edges to ensure the output has the same spatial dimensions as the input.
- Formula for padding width:

\text{Padding} = \frac{\text{Filter Size} - 1}{2}

Example: $28 \times 28$ image with $3 \times 3$ filters → $28 \times 28$ feature map.

How Strides and Padding Apply in Our Case?

1Conv2D(32, (3, 3), strides=1, padding='valid')

Strides = 1:
- The filter moves one pixel at a time (maximum overlap between regions).
- The output feature map is $26 \times 26$ (calculated using the formula without padding).
Padding = 'valid':
- No padding is added, so the edges of the image are not included in the convolution.

If we change to:

1Conv2D(32, (3, 3), strides=2, padding='same')

Strides = 2:
- The filter moves 2 pixels at a time, reducing spatial dimensions (fewer overlaps).
- The output feature map becomes $14 \times 14$ .
Padding = 'same':
- Zero-padding is added to maintain the original spatial dimensions where possible.

Visualizing the Process

Let’s break down how filters work with strides and padding:

Case 1: Strides = 1, Padding = 'valid'

The first filter processes the entire image and produces its own $26 \times 26$ feature map.
The second filter does the same, producing another $26 \times 26$ feature map.
This happens for all 32 filters, resulting in $26 \times 26 \times 32$ .

Case 2: Strides = 2, Padding = 'same'

The first filter skips every other pixel, producing a smaller $14 \times 14$ feature map.
The second filter does the same, producing another $14 \times 14$ feature map.
Again, this happens for all 32 filters, resulting in $14 \times 14 \times 32$ .

Python Code Example

Here’s a commented Python example using TensorFlow/Keras:

1import tensorflow as tf
2from tensorflow.keras.models import Sequential
3from tensorflow.keras.layers import Conv2D, MaxPooling2D
4
5# Create a simple Sequential model
6model = Sequential()
7
8# Add a Conv2D layer
9# 32 filters, 3x3 kernel, stride of 1, 'same' padding
10model.add(Conv2D(32, (3, 3), strides=1, padding='same', input_shape=(28, 28, 1)))
11
12# Add a MaxPooling layer
13# Pooling reduces spatial dimensions (2x2 pool size)
14model.add(MaxPooling2D(pool_size=(2, 2)))
15
16# Summary of the model
17model.summary()

Key Takeaways

Filters in Conv2D are applied simultaneously, with each filter producing one feature map.
Strides control how far the filter moves with each step, influencing the size of the output.
Padding determines whether or not the edges of the input are included in the convolution.

ConvLayers & PoolingLayers: Strides & Padding - Analytical Approach

VGG16-02

1. Output Dimensions for Convolutional Layers

Case 1: When Padding is Present (No Stride)

The formula to compute the output dimensions:

\text{Output Dimensions} = \left(\text{Input Size} + 2P - F\right) + 1

Where:

$P$ = Padding size (number of pixels added to the edges of the input).
$F$ = Filter (kernel) size.
$S = 1$ (stride is 1).

Case 2: When Padding and Stride Are Present

The formula becomes:

\text{Output Dimensions} = \left\lfloor \frac{\text{Input Size} + 2P - F}{S} \right\rfloor + 1