Taken together, these studies suggest that what matters for efficient and accurate vision models are the particular layer ingredients found in the Metaformer block (tokenization, independent spatial and channel processing, normalization and residual blocks) and the inductive biases typically found in CNNs (local processing with weight sharing and a hierarchical network structure). Clearly, this conclusion does not imply a special role for MLPs, as the Metaformer structure building on purely convolutional layers works (almost) just as well.
So are there other reasons for the recent focus on V-MLPs? The above-mentioned convolutional Metaformers were all tested on vision tasks and it is well known that the convolutional structure matches well with natural image statistics. Indeed, as mentioned above the best performing V-MLPs and ViTs (re-)introduce the inductive biases, such as local hierarchical processing, typically found in CNNs. However, if one is interested in a generic model that performs well in multimodal tasks and has lower computational complexity than standard transformers, an MLP-based network can be a good choice. For example, some initial results show that MLP-based Metaformers also perform well on NLP tasks [18, 29].
An additional benefit of isotropic MLPs is that they scale more easily. This scalability can make it easier to implement them on compute infrastructure that relies on regular compute patterns. Furthermore, it facilitates capturing the high information content of large (multimodal) datasets.
So based on current findings we can formulate the following practical guidelines: for settings that are significanlty resource- and data-constrained, such as edge computing, there is currently little evidence that V-MLPs, like ViTs, are a superior alternative to CNNs. However, when datasets are large and/or multimodal, and compute is more abundant, pure MLP-based models may be a more efficient and generic choice compared to CNNs and transformer-based models that rely on self-attention.
We are still in the early days of examining the possibilities of MLP-based models. In just 9 months the accuracy of V-MLPs on ImageNet classification increased by a stunning ~8%. It is expected that these models will improve further and that hybrid networks, which properly combine MLPs, CNNs and attention mechanisms, have the potential to significantly outperform existing models (e.g. [30]). We are excited to be part of this future.
Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage! References
[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012): 1097-1105.
[2] Karen Simonyan, Andrew Zisserman: “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 2014; [http://arxiv.org/abs/1409.1556 arXiv:1409.1556].
[3] Min Lin, Qiang Chen, Shuicheng Yan: “Network In Network,” 2013; [http://arxiv.org/abs/1312.4400 arXiv:1312.4400].
[4] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer: “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” 2016; [http://arxiv.org/abs/1602.07360 arXiv:1602.07360].
[5] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[6] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam: “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017; [http://arxiv.org/abs/1704.04861 arXiv:1704.04861].
[7] Mingxing Tan, Quoc V. Le: “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” 2019, International Conference on Machine Learning, 2019; [http://arxiv.org/abs/1905.11946 arXiv:1905.11946].
[8] Ross Wightman, Hugo Touvron, Hervé Jégou: “ResNet strikes back: An improved training procedure in timm,” 2021; [http://arxiv.org/abs/2110.00476 arXiv:2110.00476].
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2018; [http://arxiv.org/abs/1810.04805 arXiv:1810.04805].
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby: “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” 2020; [http://arxiv.org/abs/2010.11929 arXiv:2010.11929].
[11] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).
[12] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou: “Training data-efficient image transformers & distillation through attention,” 2020; [http://arxiv.org/abs/2012.12877 arXiv:2012.12877].
[13] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, Ross Girshick: “Early Convolutions Help Transformers See Better,” 2021; [http://arxiv.org/abs/2106.14881 arXiv:2106.14881].
[14] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo: “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” 2021; [http://arxiv.org/abs/2103.14030 arXiv:2103.14030].
[15] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao: “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,” 2021; [http://arxiv.org/abs/2102.12122 arXiv:2102.12122].
[16] Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira: “Perceiver: General Perception with Iterative Attention,” 2021; [http://arxiv.org/abs/2103.03206 arXiv:2103.03206].
[17] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy: “MLP-Mixer: An all-MLP Architecture for Vision,” 2021; [http://arxiv.org/abs/2105.01601 arXiv:2105.01601].
[18] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou: “ResMLP: Feedforward networks for image classification with data-efficient training,” 2021; [http://arxiv.org/abs/2105.03404 arXiv:2105.03404].
[19] Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan O. Arik, Tomas Pfister: “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding,” 2021; [http://arxiv.org/abs/2105.12723 arXiv:2105.12723].
[20] Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, Yunhe Wang: “Hire-MLP: Vision MLP via Hierarchical Rearrangement,” 2021; [http://arxiv.org/abs/2108.13341 arXiv:2108.13341].
[21] Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li: “S^2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision,” 2021; [http://arxiv.org/abs/2108.01072 arXiv:2108.01072].
[22] Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi: “ConvMLP: Hierarchical Convolutional MLPs for Vision,” 2021; [http://arxiv.org/abs/2109.04454 arXiv:2109.04454].
[23] Yuki Tatsunami, Masato Taki: “RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?” 2021; [http://arxiv.org/abs/2108.04384 arXiv:2108.04384].
[24] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan: “MetaFormer is Actually What You Need for Vision,” 2021; [http://arxiv.org/abs/2111.11418 arXiv:2111.11418].
[25] Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Yanxi Li, Chao Xu, Yunhe Wang: “An Image Patch is a Wave: Quantum Inspired Vision MLP,” 2021; [http://arxiv.org/abs/2111.12294 arXiv:2111.12294].
[26] Ziyu Wang, Wenhao Jiang, Yiming Zhu, Li Yuan, Yibing Song, Wei Liu: “DynaMixer: A Vision MLP Architecture with Dynamic Mixing,” 2022; [http://arxiv.org/abs/2201.12083 arXiv:2201.12083].
[27] Asher Trockman, J. Zico Kolter: “Patches Are All You Need?” 2022; [http://arxiv.org/abs/2201.09792 arXiv:2201.09792].
[28] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie: “A ConvNet for the 2020s,” 2022; [http://arxiv.org/abs/2201.03545 arXiv:2201.03545].
[29] Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le: “Pay Attention to MLPs,” 2021; [http://arxiv.org/abs/2105.08050 arXiv:2105.08050].
[30] Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou: “Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs,” 2022; [http://arxiv.org/abs/2202.06510 arXiv:2202.06510].