2022-01-20

Transformers in Computer
Vision

Bert Moons | Director – System Architecture at AXELERA AI

Summary

Convolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet[1]on the ImageNet[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet[3], RegNet[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet[5] or EfficientNet[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application.

A typical ResNet-style CNN is given in Figure 1-1 and Figure 1-4 (a). Typically, such networks have several features:

They interleave or stack 1×1 and kxk convolutions to balance the cost of convolutions with building a large receptive field,
Training is stabilized by using batch-normalization and residual connections.
Feature maps are built hierarchically by gradually reducing the spatial dimensions (W,H), finally downscaling them by a factor of 32x.
Feature maps are built pyramidally, by increasing the embedding dimensions of the layers from the range of 10 channels in the first layers to 1000s in the last

Figure 1-1: Illustration of ResNet34 [3]

Within these broader families of backbone networks, researchers have developed a set of techniques known as Neural Architecture Search (NAS)[7] to optimize the exact parametrizations of these networks. Hardware-Aware NAS methods automatically optimize a network’s latency while maximizing accuracy, by efficiently searching over its architectural parameters such as the number of layers, the number of channels within each layer, kernel sizes, activation functions and so on. So far, due to high training costs, these methods have failed to invent radically new architectures for Computer Vision. They mostly generate networks within the ResNet/MobileNet hybrid families, leading to only modest improvements of 10-20% over their hand-designed baseline[8].

Evaluate industry defining AI inference technology today. 1/3

Which evaluation kit do you want?

This field is required!

Company name

This field is required!

What is your focus industry/application?

This field is required!

Other industry segment

This is not correct

What best describes your company?

This is not correct.

Other company type

This is not correct

Your contact details2/3.

First name

This field is required!

Last name

This field is required!

Job Title

This field is required!

Country

United States
Canada
Afghanistan
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
British Indian Ocean Territory
British Virgin Islands
Brunei
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Cape Verde
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Cocos (Keeling) Islands
Colombia
Comoros
Congo
Cook Islands
Costa Rica
Croatia
Cuba
Curaçao
Cyprus
Czech Republic
Côte d’Ivoire
Democratic Republic of the Congo
Denmark
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Falkland Islands
Faroe Islands
Fiji
Finland
France
French Guiana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guernsey
Guinea
Guinea-Bissau
Guyana
Haiti
Honduras
Hong Kong S.A.R., China
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Isle of Man
Israel
Italy
Jamaica
Japan
Jersey
Jordan
Kazakhstan
Kenya
Kiribati
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
Macao S.A.R., China
Macedonia
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Micronesia
Moldova
Monaco
Mongolia
Montenegro
Montserrat
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
North Korea
Northern Mariana Islands
Norway
Oman
Pakistan
Palau
Palestinian Territory
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Pitcairn
Poland
Portugal
Puerto Rico
Qatar
Romania
Russia
Rwanda
Réunion
Saint Barthélemy
Saint Helena
Saint Kitts and Nevis
Saint Lucia
Saint Pierre and Miquelon
Saint Vincent and the Grenadines
Samoa
San Marino
Sao Tome and Principe
Saudi Arabia
Senegal
Serbia
Seychelles
Sierra Leone
Singapore
Slovakia
Slovenia
Solomon Islands
Somalia
South Africa
South Korea
South Sudan
Spain
Sri Lanka
Sudan
Suriname
Svalbard and Jan Mayen
Swaziland
Sweden
Switzerland
Syria
Taiwan
Tajikistan
Tanzania
Thailand
Timor-Leste
Togo
Tokelau
Tonga
Trinidad and Tobago
Tunisia
Turkey
Turkmenistan
Turks and Caicos Islands
Tuvalu
U.S. Virgin Islands
Uganda
Ukraine
United Arab Emirates
United Kingdom
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
Vatican
Venezuela
Viet Nam
Wallis and Futuna
Western Sahara
Yemen
Zambia
Zimbabwe

This is not correct.

This field is required!

Phone number

This field is required!

Back

Your project info3/3.

This field is required!

How did you hear about us?

This field is required!

Other media channel

This is not correct

By submitting your information, you consent to ourprivacy policyand authorize us to store your personal data and contact you regarding organizational details.

Join our monthly updates about the future of edge-AI! By signing up, you agree to receive regular updates from Axelera AI, as per ourprivacy policy, and stay at the forefront of AI innovation.

Back

Thank you for your ordering your Axelera Metis Evaluation Kit!

We've received your order, and a confirmation email has been sent to the provided email address. Our team is excited to review your order.

After evaluating your input, we will be in touch within the next 2 business days to discuss the next steps and how your order can benefit your innovative projects.
Stay tuned for more details coming your way soon!

Transformers in Computer Vision

A more radical evolution in Neural Networks for Computer Vision, is the move towards using Vision Transformers (ViT)[9] as a CNN-backbone replacement. Inspired by the astounding performance of Transformer models in Natural Language Processing (NLP)[10], research has moved towards applying the same principles in Computer Vision. Notable examples, among many others, are XCiT[11], PiT[12], DeiT[13] and SWIN-Transformers[14]. Here, analogously to NLP processing, images are essentially treated as sequences of image patches, by modeling feature maps as vectors of tokens, each token representing an embedding of a specific image patch.

An illustration of a basic ViT is given in Figure 1-2. The ViT is a sequence of stacked MLPs and self-attention layers, with or without residual connections . This ViT uses the multi-headed self-attention mechanism developed for NLP Transformer, see Figure 1-3. Such self-attention layer has two distinguishing features. It can (1) dynamically ‘guide’ its attention by dynamically reweighting the importance of specific features depending on the context and (2) has a full receptive field in case global self-attention is used. The latter is the case when self-attention is applied across all possible input tokens. Here all tokens, representing embeddings related to specific spatial image patches, are correlated with each other, giving a full perspective field. Global self-attention is typical in ViTs, but not a requirement. Self-attention can also be made local, by limiting the scope of the self-attention module to a smaller set of tokens, in turn reducing the operation’s receptive field at a particular stage.

This ViT architecture contrasts strongly with CNNs. In vanilla CNNs without attention mechanisms, (1) features are statically weighted using pretrained weights, rather than dynamically reweighted based on the context as in ViTs and and (2) receptive fields of individual network layers are typically local and limited by the convolutional kernel size.

Figure 1-4: comparing the dimension configurations of networks of (a) ResNet-50, a classical CNN with pyramidal feature maps, (b) an early ViT-S/16 [10] with a uniform macro-architecture and (c) a modern PiT-S [Footnote 12] with CNN-ified pyramidal feature maps. Figure taken from [Footnote 12].

Part of the success of CNNs, is their strong architectural inductive bias implied in the convolutional approach. Convolutions with shared weights explicitly encode how specific identical patterns are repeated in images. This inductive bias ensures easy training convergence on relatively small datasets, but also limits the modeling capacity of CNNs. Vision Transformers do not enforce such strict inductive biases. This makes them harder to train, but also increases their learning capacity, see Figure 1-5. To achieve good results using ViTs in Computer Vision, these networks are often trained using knowledge distillation with a large CNN-based teacher (as in DeiT[16] for example). This way, part of the inductive bias of CNNs can be more softly forced into the training process.

Initially, ViTs where directly inspired by NLP Transformers: massive models with a uniform topology and global self-attention, see Figure 1-4 (b). Recent ViTs have a macro-architecture that is closer to that of CNNs (Figure 1-4 (a)), using hierarchical pyramidal feature maps (as in PiT (Footnote 12); see Figure 1-4 (c)) and local self-attention (as in Swin-Transformers (Footnote 14). A high-level overview of this evolution is discussed in Table 1.)

Table 1: Comparing early ViTs, recent ViTs and modern CNNs

Figure 1-5: Comparing CNNs to ViTs in terms of model size (# Params) and ImageNet Top-1 Validation accuracy. (a) Shows data for all types of training: (i) training on ImageNet1k training data, (ii) using extra data such as ImageNet21k or JFT [17] and (iii) training using knowledge distillation using a CNN teacher. (b) Shows data for a subset of networks that are trained from-scratch, without CNN-based knowledge distillation , but with state-of-the-art training techniques on ImageNet. Figure (b) illustrates the lasting competivity of CNNs over ViTs, especially in the Edge domain for models with less than 25M parameters where performance is very similar between CNNs and ViTs. ResNet-50 and EfficientNet-B0 are given as reference points. Data is taken from this source[18] and the respective scientific papers.

Comparing CNNs and ViTs for Edge Computing

Even though ViTs have shown State-of-the-Art (SotA) performance in many Computer Vision tasks, they do not necessarily outperform CNNs across the board. This is illustrated in Figure 1-5 and Figure 1-6. These figures compare the performance of ViTs and CNNs in terms of ImageNet validation accuracy versus model size and complexity, for various training regimes. It’s important to distinguish between these training regimes, as not all training methodologies are feasible for specific downstream tasks. First, for some applications there are only relatively small datasets available. In that case, CNNs typically perform better. Second, many ViTs rely on distillation approaches to achieve high performance. For that to work, they need a highly-accurate pretrained CNN as a teacher, which is not always available.

Figure 1-5 (a) illustrates how CNNs and ViTs compare in terms of model size versus accuracy if all types of training are allowed, including distillation approaches and using additional data (such as JFT-300[17]). Here ViTs perform on-par or better than large-scale CNNs, outperforming them in specific ranges. Notably, XCiT (Footnote 11) models perform particularly well in the +/- 3M-Parameters range. However, when neither distillation, nor training on extra data is allowed, the difference is less pronounced, see Figure 1-5 (b). In both Figures, EfficientNet-B0 and ResNet-50 are indicated as references for context.

Figure 1-6 illustrates the same in terms of accuracy versus model complexity for a more limited set of known networks. Figure 1-6(a) and (b) show CNNs are mostly dominant for lower accuracies and networks with lower complexity (<1B FLOPS) for all types of training. This holds even for CNN-ified Vision-Transformers such as PiT (Footnote 12) which use a hierarchical architecture with pyramidal feature maps and for SWIN transformers which optimize complexity by using local self-attention. Without extra data or distillation, CNNs typically outperform ViTs across the board, especially for networks with a lower complexity or for networks with accuracies lower than 80%. For example, at a similar complexity, both RegNets and EfficientNet-style networks significantly outperform XCiT ViTs, see Figure 1-6 (b).

Figure 1-6: Comparing SotA CNNs to ViTs in terms of computational cost (# FLOPS) and ImageNet Top-1 Validation Accuracy. (a) Shows data for all types of training: (i) training on ImageNet1k training data, (ii) using extra data such as ImageNet21k or JFT [17] and (iii) training using knowledge distillation using a CNN teacher. (b) Shows data for a subset of networks that are trained from-scratch, without extra data or knowledge distillation, but with state-of-the-art training techniques on ImageNet. Figure (b) illustrates how CNNs are still dominant in the <80% accuracy regime. Even CNN-ified modern ViTs with hierarchical pyramidal models such as PiT [Footnote 12] do not outperform EfficientNet [Foonote 6] and RegNet [Footnote 4] style CNNs. In the 80%+ range, networks with local self-attention such as SWIN [Foonote 14] are on par or better than RegNets [Footnote 4]. Data is taken from Footnote 16 and the respective scientific papers.

Apart from the high-level differences in Table 1 and the performance differences in this section, there are some other key different requirements in bringing ViTs to edge devices. Compared to CNNs, ViT rely much more on 3 specific operations that must be properly accelerated on-chip. First, ViTs rely on accelerated softmax operators as part of self-attention, while CNNs only require softmax as the final layer in a classification network. On top of that, ViTs typically use smooth-nonlinear activation functions, while CNNs mostly rely on Rectified Linear Units (ReLU) which are much cheaper to execute and accelerate. Finally, ViTs typically require LayerNorm, a form of layer normalization with dynamic computation of mean and standard deviation to stabilize training. CNNs however, typically use batch-normalization, which must only be computed during training and can essentially be ignored in inference by folding the operation into neighbouring convolutional layers.

Conclusion

Vision Transformers are rapidly starting to dominate many applications in Computer Vision. Compared to CNNs, they achieve higher accuracies on large data sets due to their higher modeling capacity and lower inductive biases as well as their global receptive fields. Modern, improved and smaller ViTs such as PiT and SWIN are essentially becoming CNN-ified, by reducing receptive fields and using hierarchical pyramidal feature maps. However, CNNs are still on-par or better than SotA ViTs on ImageNet in terms of model complexity or size versus accuracy, especially when trained without knowledge distillation or extra data and when targeting lower accuracies.

Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!

References

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems25 (2012): 1097-1105.

[2][2] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).

[3] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[4] Radosavovic, Ilija, et al. “Designing network design spaces.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[5] Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

[6] Tan, Mingxing, and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks.” International Conference on Machine Learning. PMLR, 2019.

[7] He, Xin, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A Survey of the State-of-the-Art.” Knowledge-Based Systems 212 (2021): 106622.

[8] Moons, Bert, et al. “Distilling optimal neural networks: Rapid search in diverse spaces.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021

[9] Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

[10] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[11] El-Nouby, Alaaeldin, et al. “XCiT: Cross-Covariance Image Transformers.” arXiv preprint arXiv:2106.09681 (2021).

[12] Heo, Byeongho, et al. “Rethinking spatial dimensions of vision transformers.” arXiv preprint arXiv:2103.16302 (2021).

[13] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International Conference on Machine Learning. PMLR, 2021.

[14] Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” arXiv preprint arXiv:2103.14030 (2021).

[15] Li, Yawei, et al. “Spatio-Temporal Gated Transformers for Efficient Video Processing.”, NeurIPS ML4AD Workshop, 2021

[16] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International Conference on Machine Learning. PMLR, 2021.

[17] Sun, Chen, et al. “Revisiting unreasonable effectiveness of data in deep learning era.” Proceedings of the IEEE international conference on computer vision. 2017.

[18] Ross Wightman, “Pytorch Image Models”, https://github.com/rwightman/pytorch-image-models, seen on January 10, 2022

Adapted photograph of robot arm on factory belt, showcasing machine vision & Industry 4.0

2024-08-19

AI TECH INSIGHT

Challenges and Opportunities of Machine Learning in Quality Control

Discover how vision inspection system manufacturers can tackle the challenges associated with applying machine learning in quality control.

2024-07-10

AI TECH INSIGHT

How our quantization methods make the Metis AI PU highly efficient and accurate

Read all about our unique quantization techniques that obsolete model retraining & enable the most powerful and energy-efficient AI accelerators.

2024-04-23

AI TECH INSIGHT

AI access control: How to accelerate verification without sacrificing accuracy

Vision AI can make access control less invasive. AI accelerators can increase verification speed in AI Access Control without increasing false positives in security.

2024-04-09

AI TECH INSIGHT

Using oneAPI construction kit to enable open standards programming for the Metis AIPU

Open standards enable developers to more easily harness the power of AI accelerators, especially in heterogenous computing. Here you can read in detail why and how we implemented OpenCL using oneAPI on Metis.

2024-01-22

Davos 2024: AI’s Evolution and the Edge Revolution

At this year’s World Economic Forum in Davos, the spotlight was firmly placed on artificial intelligence (AI), reflecting its growing importance across various sectors. The discussions not only highlighted AI’s expansive role but also emphasized the evolving trend of edge computing, driven by specialized hardware accelerators.

2023-5-02

How Will Generative AI Revolutionize Our Work?

On Labor Day, a day dedicated to celebrating the achievements and perseverance of the workforce, we find ourselves on the cusp of a new era where artificial intelligence (AI) is poised to transform the labor market.

AI chip of Axelera AI placed on black hardware

2023-12-15

The Metis AI Platform A technical Deepdive

The Metis AI Platform is a one-of-a-kind holistic hardware and software solution establishing best-in-class performance, efficiency, and ease of use for AI inferencing of computer vision workloads at the Edge.

Image to promote interview with Stephen Owen

2023-11-14

Interview with Stephen Owen, Axelera AI Advisor

Stephen Owen, Axelera AI Advisor, is an experienced Board Level International Executive with over 16 years of executive-level experience in an S&P Top 500 Semiconductor Company and extensive global leadership and organizational expertise.

2023-10-11

Harnessing the RISC-V Wave: The Future is Now

RISC-V is inevitable – it became the mantra of RISC-V, and it’s true. But before we see why that is, let’s step back and discuss what RISC-V is and why we should care.

Hand holding Metis AI processing unit with two brains in the background

2023-06-14

Cheap Computing and the Balancing Act of Population Decline

Imagine a world where computing power reaches a historic practical equivalent of two human brains. In this blog article by our Director of Systems Software, Cristian Olar explores how our revolutionary Metis AIPU achieves a remarkable 200 TOPS result at a fraction of today’s costs.

HTC5, High Tech Campus
5656 AE Eindhoven
The Netherlands
Email: info@axelera.ai

Reducing CO2 with
Axelera’s Forest

Thank you for your newsletter subscription

Transformers in Computer
Vision

Summary

Convolutional Neural Networks

Thank you for your ordering your Axelera Metis Evaluation Kit!

Transformers in Computer Vision

Comparing CNNs and ViTs for Edge Computing

Conclusion

References

Challenges and Opportunities of Machine Learning in Quality Control

How our quantization methods make the Metis AI PU highly efficient and accurate

AI access control: How to accelerate verification without sacrificing accuracy

Using oneAPI construction kit to enable open standards programming for the Metis AIPU

Davos 2024: AI’s Evolution and the Edge Revolution

How Will Generative AI Revolutionize Our Work?

The Metis AI Platform A technical Deepdive

Interview with Stephen Owen, Axelera AI Advisor

Harnessing the RISC-V Wave: The Future is Now

Cheap Computing and the Balancing Act of Population Decline

Address

Menu

Company

Follow Us

Sign Up for Our Newsletter

Transformers in ComputerVision

Summary

Convolutional Neural Networks

Thank you for your ordering your Axelera Metis Evaluation Kit!

Transformers in Computer Vision

Comparing CNNs and ViTs for Edge Computing

Conclusion

References

Challenges and Opportunities of Machine Learning in Quality Control

How our quantization methods make the Metis AI PU highly efficient and accurate

AI access control: How to accelerate verification without sacrificing accuracy

Using oneAPI construction kit to enable open standards programming for the Metis AIPU

Davos 2024: AI’s Evolution and the Edge Revolution

How Will Generative AI Revolutionize Our Work?

The Metis AI Platform A technical Deepdive

Interview with Stephen Owen, Axelera AI Advisor

Harnessing the RISC-V Wave: The Future is Now

Cheap Computing and the Balancing Act of Population Decline

Address

Menu

Company

Follow Us

Sign Up for Our Newsletter

Transformers in Computer
Vision