Deep Learning — CNN

Convolutional Neural Networks (CNNs)

6 min readJul 31, 2024

Convolutional Neural Networks (CNNs) are a type of deep learning architecture that is specifically designed for image and video data processing. They are based on the concept of convolution, which is a mathematical operation used to extract features from images. CNNs are composed of a series of layers, each designed to extract increasingly complex features from the input data, making them well-suited for tasks such as image classification, object detection, and image generation.

A CNN typically consists of several layers, including:

Convolutional layers: These layers are the building blocks of CNNs and are responsible for extracting features from the input data. They use a set of filters, also known as kernels, to scan the input image and extract features such as edges, textures, and patterns. The filters are learned during the training process, and the output of the convolutional layer is called a feature map.
Pooling layers: Pooling layers are used to down-sample the feature maps and reduce the spatial dimensions of the input data. This is done by taking the maximum or average value of a group of pixels, and the main role of pooling is to reduce the computational cost, and also make the feature representation more robust to small translations of the image.
Fully connected layers: Fully connected layers are used to make the final decision based on the extracted features. They are similar to traditional multi-layer perceptrons (MLPs), and their role is to classify the input data into different categories. The output of the fully connected layer is passed through a softmax function, which produces a probability distribution over the different classes.
Normalization layers: Normalization layers can be used in several places within a CNN, their role is to normalize the output of a layer, in order to make the input of the next layer more stable. There are several types of normalization layers, such as Batch Normalization and Layer Normalization.

The training process of a CNN is done by minimizing a loss function, in most cases, the cross-entropy loss, and use an optimization algorithm such as Stochastic Gradient Descent (SGD) or Adam.

CNNs have shown remarkable results in a wide range of computer vision tasks, such as image classification, object detection, semantic segmentation, and image generation. They have been used to achieve state-of-the-art results on datasets such as ImageNet, COCO, and others.

In summary, Convolutional Neural Networks (CNNs) are a type of deep learning architecture that is specifically designed for image and video data processing. They are composed of several layers, including convolutional layers that extract features from the input data, pooling layers that reduce the spatial dimensions of the data, fully connected layers that make the final decision based on the extracted features, and normalization layers that help to stabilize the training. CNNs have achieved state-of-the-art results in a wide range of computer vision tasks and have become a key method in the field of computer vision.

Convolution Output

In CNN we feed the entire image. So all images of the dataset are fed into the network. What happens when we feed the image into the network? You could see in the above image that there are some patches in the image mentioned in Red, Blue, Green, and Black. Each neuron is connected to only a certain number of pixels and this is called a receptive field of a particular neuron. In this manner, there will be many neurons covering the entire image. So the small patch in the image can represent the 3X3 image or patch (marked with the brown arrow), connected to a specific neuron. This 3X3 patch is connected to a specific neuron, and then each pixel is connected by a weight. So let’s say the first pixel is connected by weight 1, and the second pixel is connected with a weightage of 2. Similarly, all these nine pixels are connected by neurons via nine weights. When we convolve this 3X3 patch or matrix with these filter weights, then we get a number. What we mean by convolving is, we multiply these filter weights pixel by pixel and then add these products to get some number. This number will be the output of the neuron.

When we look at another receptive field, the filter moves its place to another point also called stride= 1 (see the above image). This moving of the filter to another point means that another neuron is looking at the second receptive field. The number of times this filter moves is equal to the number of neurons that are viewing this image. Please note that the same set of weights is moving across the image.

Max Pooling Operation

The pooling layers section would reduce the number of parameters when the images are too large.
Spatial pooling also called subsampling, reduces the dimensionality of each map but retains important information.