Various Computer Vision Architectures — What’s the difference

5 min readNov 27, 2021

Computer Vision started as a topic in the late 1960s. The aim of computer vision is to mimic the human vision and understanding. The base of Computer Vision is image processing. In the modern era, Image processing is combined with AI Algorithms to mimic the functioning of human brain on images. Unlike numerical data, images need to be processed and interpreted in a different manner for the compute to make any meaningful output. Since a long time, CNNs have been trending in the field of Machine Learning. But traditional CNNs have many disadvantages when thinking in large scale. To overcome this, there are many architectures that have been designed. In this article, we will have a look at various popular computer vision architectures, their design and specialty.

AlexNet

AlexNet was designed in 2012 for large scale image classification, 1000 classes to be exact. The design starts with 11x11 convolution layers, moves to 5x5 and narrows down to 3x3. AlexNet contains total of 8 layers — 5 CNN and 3 Pooling.

The AlexNet achieves good accuracy because of training large dataset (ImageNet) on GPUs.

ResNet

Very Deep Networks might provide good results, but at some point, of time can result in Vanishing Gradient Problem. This arises because of the large number of weights to update, the gradient keeps getting smaller as the algorithm reaches towards the earlier layers of the network (Backpropagation starts from back). ResNet is a large-scale Deep Neural Network that uses Skip Connections to prevent Vanishing Gradient and increase the overall performance of the network. ResNet skips over few layers, generally 2 or 3. This ensures that gradient does not drop to 0 on earlier layers when the network is trained. To Understand better, take a look at the design:

VGG

Similar to AlexNet, VGG is a large-scale image classification based on the traditional CNN Architecture. VGG has other variants too — VGG-16 and VGG-19 are the popularly used ones with 16 and 19 layers respectively. The difference between VGG and AlexNet is that VGG focuses on smaller kernel and strides compared to AlexNet and has a deeper architecture too. Also, VGG has more number of ReLU units in comparison with AlexNet because of the deeper architecture and hence the mapping function is more discriminative and performs better.

GoogleNet

GoogeNet is a CNN architecture developed by Google Researchers using a modified version of Inception Module. The inception module uses multiple filters side by side and combines their outputs into a single layer.

The advantage of Inception is that the location of information within an image does not affect the network since multiple filters are combined and that information is combined all into one. This avoids the need of any deep networks and vanishing gradient descent at the same time preserving the accuracy of any network. Also, wider networks is easier to train in comparison with deeper networks.

https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-22_at_3.28.59_PM.png

On observing the architecture closely, you can observe two softmax layers in the middle part of the network. The loss from these two are taken into consideration during the training phase to prevent any vanishing gradient in case. This is proposed to come into play especially during the end of training, when the loss and accuracy saturates.

DenseNet

The networks we saw till date send the output of one layer to another in a sequential manner. DenseNet works quite opposite to that. DenseNet is a densely connected Neural Network where the output of each layer is sent to every other layer to implement feature reuse. This way, the networks can be shallow but at the same time learn the features effectively because of its dense nature.

https://cloud.githubusercontent.com/assets/8370623/17981494/f838717a-6ad1-11e6-9391-f0906c80bc1d.jpg

The DenseNet is mainly made up of two units — DenseBlock and Transition Layer. DenseBlock has Convolution, Batch Normalization and ReLU with Dense connections while Transition Layer reduces the complexity of the model using 1x1 convolutions and reducing heigh and width of the output using 2-stride maxpooling.

https://miro.medium.com/max/1400/1*qg5cCnke3684W1w5z32ddg.png

EffecientNet

Efficient Net takes into account scaling of CNNs. According to experimental studies, it is proven that as the image resolution increases, increasing depth and height is necessary to bring about a good accuracy. But scaling any one will have only limited benefits — Hence the concept of compound scaling. Compound scaling devises a principle where depth, width and resolution has to be scaled in a balanced manner. On performing a grid search for the parameters alpha, beta and gamma that define how much scaling should happen for depth, width and resolution, it is concluded that to scale a CNN, depth should increase 20%, width should increase 10% and resolution should increase 15%. These values maintain a balance among the scaling and provide efficient results.

https://1.bp.blogspot.com/-Cdtb97FtgdA/XO3BHsB7oEI/AAAAAAAAEKE/bmtkonwgs8cmWyI5esVo8wJPnhPLQ5bGQCLcBGAs/s1600/image4.png

Thank you for reading!

Useful Links:

Find me on LinkedIn: https://linkedin.com/in/vishnuu0399

Know more about Me: https://bit.ly/vishnu-u

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Various Computer Vision Architectures — What’s the difference

Useful Links:

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Vishnu U