Convolutional Neural Network
Convolutional Neural Network
Created on Aug 07, 2025, Last Updated on Aug 25, 2025, By a Developer
Concepts
Convolution Layer
For a single channel convolution layer, the layer will container filter of size
Given input of size
Taking channel into consideration, if input have
Pooling Layer
Pooling Layer has no parameters associate, the only thing doing is shrinking the size of the volumn. For example shrink input with size
Inception Module
Convolution Layer with different size
Projection
Projection is another name for convolution layer with filter size as 1, and only have one channel. it flatten the input on the axis of channel.
Transpose Convolution Layer
Transpose Convolution Layer behave in an opposite way of how Convolution Layer works. It take one value from the input and multiply with the filter, and repeat this for every cell, and overlay all them with some overlapping, and getting an output with size bigger than the input.
Image Tasks
Object Detection
Object Detection take convolutional neural networks to the next stage. Other than trying to classify image into categories, the model also identify the position of the object inside the image by outputting a bounding box
To consider the model has a correct prediction, we cannot expect the bounding box predicted and labeled are completely the same. IoU(Intersection Over Union) is usually calculated to indicate the correctness.
Historically it is more than often to chunk the image into small pieces (with different size and overlapping with each other) and run the model on each piece separately to have a better prediction accuracy. Therefore, It is also fairly common to have one model predict the position of multiple object. One challenge is whether and which box(es) to pick or ignore when multiple prediction are valid. For boxes overlapping with each other, IoU is used again to define two bounding boxes are describing the same object.
Image Segmentation
Image Segmentation is type of Image task taking Object Detection to the next level. Other than just identifying bounding box of detected object, it also identify the class label each pixel got classified to. The output size of the model would be completely the same as the input. Each output value indicate the class the corresponding pixel belongs to.
Image Style
Image Style detection is a by-product of a CNN image model. When looking at the hidden layers of the neural network, each channel of one layer represent some aspects of an image, can be color, texture, strikes, and etc. One way to understand style is the correlation between channels.
Face Recognition
Face Recognition is a One Shot Learning Task, meaning the production input won’t be part of training set, the model only have one chance to see it. So the task essentially is finding the degree of similarity between two image in terms of identity, somewhere similar to Semantic Similarity in language tasks.
The model input an image and output an image embeddings represent the identity. By applying some similarity metric (Cosine Similarity, Euclidean Distance) on two embeddings to find the level of similarity of two image.
During training, the model is trained on three input in one shot, anchor input, positive input, negative input. Obviously, anchor is more close to positive one rather than negative one. The loss function is straight forward as well.
Case Study
MobileNet
The key innovation of MobileNet comparing normal one is its drastically reducing computation resources required without losing prediction accuracy too much.
It achieve so by breaking a normal convolution layer into two step to save computation resources. It apply one filter per input channel without apply sum on top of that, without doing any sum on top of that, and then applying a projection to flatten that into one channel.
A normal convolution layer cost
YOLO
YOLO Stands for You Only Look Once. The major innovation is just as its name, it looks at each pixel in the image only once, different from the historic way mentioned in Object Detection.
It logically chunks the image into
U-Net
U-Net is an implementation of Image Segmentation. The model has two main parts, then Contracting Path and Expanding Path.
- The Contracting Path repeats blocks of
Convolution → ReLU → Convolution → ReLU → MaxPooling
, each block half the feature size, and double the channel number. - The Expanding Path repeats blocks of
Convolution → ReLU → Convolution → ReLU → Transpose Convolution
, each block double the feature size, and half the channel number. - The feature size of the nth Contracting Path layer would be the same as the last nth Expanding Path, we add a skip connection between them.