Transformer驱动的图像分类研究进展

石争浩; 李成建; 周亮; 张治军; 仵晨伟; 尤珍臻; 任文琦

doi:10.11834/jig.220799

复杂场景图像目标智能检测 | 浏览量 : 0 下载量: 1 CSCD: 1

PDF
导出
分享
收藏
专辑

Transformer驱动的图像分类研究进展
Survey on Transformer for image classification
2023年28卷第9期页码：2661-2692
纸质出版日期： 2023-09-16 ，
DOI： 10.11834/jig.220799
稿件说明：

移动端阅览

石争浩，李成建，周亮，张治军，仵晨伟，尤珍臻，任文琦. 2023. Transformer驱动的图像分类研究进展. 中国图象图形学报， 28(09):2661-2692

Shi Zhenghao， Li Chengjian， Zhou Liang， Zhang Zhijun， Wu Chenwei， You Zhenzhen， Ren Wenqi. 2023. Survey on Transformer for image classification. Journal of Image and Graphics， 28(09):2661-2692
石争浩，李成建，周亮，张治军，仵晨伟，尤珍臻，任文琦. 2023. Transformer驱动的图像分类研究进展. 中国图象图形学报， 28(09):2661-2692 DOI： 10.11834/jig.220799.

Shi Zhenghao， Li Chengjian， Zhou Liang， Zhang Zhijun， Wu Chenwei， You Zhenzhen， Ren Wenqi. 2023. Survey on Transformer for image classification. Journal of Image and Graphics， 28(09):2661-2692 DOI： 10.11834/jig.220799.

摘要

图像分类是图像理解的基础，对计算机视觉在实际中的应用具有重要作用。然而由于图像目标形态、类型的多样性以及成像环境的复杂性，导致很多图像分类方法在实际应用中的分类结果总是差强人意，例如依然存在分类准确性低、假阳性高等问题，严重影响其在后续图像及计算机视觉相关任务中的应用。因此，如何通过后期算法提高图像分类的精度和准确性具有重要研究意义，受到越来越多的关注。随着深度学习技术的快速发展及其在图像处理中的广泛应用和优异表现，基于深度学习技术的图像分类方法研究取得了巨大进展。为了更加全面地对现有方法进行研究，紧跟最新研究进展，本文对Transformer驱动的深度学习图像分类方法和模型进行系统梳理和总结。与已有主题相似综述不同，本文重点对Transformer变体驱动的深度学习图像分类方法和模型进行归纳和总结，包括基于可扩展位置编码的Transformer图像分类方法、具有低复杂度和低计算代价的Transformer图像分类方法、局部信息与全局信息融合的Transformer图像分类方法以及基于深层ViT（visual Transformer）模型的图像分类方法等，从设计思路、结构特点和存在问题等多个维度、多个层面深度分析总结现有方法。为了更好地对不同方法进行比较分析，在ImageNet、CIFAR-10（Canadian Institute for Advanced Research）和CIFAR-100等公开图像分类数据集上，采用准确率、参数量、浮点运算数（floating point operations，FLOPs）、总体分类精度（overall accuracy，OA）、平均分类精度（average accuracy，AA）和Kappa（κ）系数等评价指标，对不同方法模型的分类性能进行了实验评估。最后，对未来研究方向进行了展望。

Abstract

Image classification is an important research direction in the field of image processing and computer vision. It aims to identify the specific category of the object in the image and has important practical application value. However， the classification effect of the existing methods is always unsatisfactory because of the diversity of the shape and type of image objects and the complexity of the imaging environment. Moreover， the existing problems， such as low classification accuracy and high false positives， seriously affect the application of image classification in the subsequent image and computer vision-related tasks. Therefore， improving image classification accuracy through postprocessing algorithms is highly desirable. Given the wide application of deep learning techniques， such as deep convolutional neural networks and generative adversarial neural networks， in the field of natural image object detection， the research on the application of deep learning techniques in image classification has received great attention and become a research hotspot in the field of image processing and computer vision in recent years. Moreover， many excellent works have been born. As a rising star， visual Transformer （ViT） gains an increasing interest in image processing tasks， particularly because of its strong ability of remote modeling and parallel sequence processing. Several technical review articles on the Transformer have been recently published. Moreover， ViT and its variants have been systematically summarized from different angles， and the application of the Transformer in different visual tasks has been introduced. This scenario provides appropriate help for people studying and tracking the research progress of image classification technology. Compared with traditional convolutional neural network （CNN）， ViT achieves global modeling and parallel processing of the image by dividing the input image into patches. Thus， the image classification ability of the model is greatly improved. However， many problems， such as poor scalability， high computational overhead， slow convergence， and attention collapse， still exist because of the complexity of image classification problems and the diversity of the development of ViT technology. These problems can be solved using the ViT variants in image processing tasks. Moreover， the reviews that can help scholars comprehensively understand and grasp the latest progress of ViT for image processing tasks from a global perspective are very few. Therefore， the present study systematically compares and summarizes the ViT algorithms for image classification based on the full study of the latest reviews and related research to help scholars understand and grasp the latest progress of image classification research based on ViT. Unlike the existing review papers， our work is particularly focused on the research methods at home and abroad in the past 2 years （between January 2021 and December 31， 2022）. We begin by describing the basic concept， principle， and structure of the traditional Transformer model for easy understanding. First， we introduce the attention mechanism and multi-head attention mechanism. Then， the feed-forward neural network and position coding are described. Finally， the model structure of the traditional Transformer is presented. Afterward， the evolution of the Transformer model and its applications in image processing in recent years are figured. Then， the concept， principle， and structure of ViT are briefly introduced. Various vision Transformer models and applications in image classification are described in detail according to the problems faced by ViT. Different solutions， including scalable location coding， low complexity， low computing cost， local and global information fusion， and deep ViT model， are described one by one. Experiments on ImageNet， Canadian Institute for Advanced Research （CIFAR-10）， and CIFAR-100 are provided， and many evaluations are presented to demonstrate the classification performance of the ViT and its variants for image classification. Two indicators are adopted， namely， accuracy and parameter quantity， to evaluate experimental results. Floating point operation （FLOPs） per second is also used to analyze the performance of the model comprehensively. Given that the Transformer has also been widely used in remote sensing image classification in recent years， the present study compares and analyzes the remote sensing image classification methods based on the Transformer. The experiments are performed on the hyperspectral image datasets of Indian Pines， Trento， and Salinas to evaluate the Transformer for the remote sensing image classification. Three indicators， namely， overall accuracy （OA）， average accuracy （AA）， and Kappa coefficient， are employed in this work. Finally， the problems and challenges faced by the current application of ViT in image classification are presented. Future research and development trends are also prospected.

关键词

Transformer自注意力机制深度学习图像分类可扩展位置编码

Keywords

Transformerself-attention mechanismdeep learningimage classificationscalable position encoding

references

Carion N， Massa F， Synnaeve G， Usunier N， Kirillov A and Zagoruyko S. 2020. End-to-end object detection with Transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 213-229 ［DOI： 10.1007/978-3-030-58452-8_ 13http://dx.doi.org/10.1007/978-3-030-58452-8_13］

Chen B Y， Li P X， Li C M， Li B P， Bai L， Lin C， Sun M， Yan J J and Ouyang W L. 2021c. GLiT： neural architecture search for global and local image Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 12-21 ［DOI： 10.1109/iccv48922.2021.00008http://dx.doi.org/10.1109/iccv48922.2021.00008］

Chen H T， Wang Y H， Guo T Y， Xu C， Deng Y P， Liu Z H， Ma S W， Xu C J， Xu C and Gao W. 2021b. Pre-trained image processing Transformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 12294-12305 ［DOI： 10.1109/cvpr46437.2021.01212http://dx.doi.org/10.1109/cvpr46437.2021.01212］

Chen Q， Wu Q M， Wang J， Hu Q H， Hu T， Ding E R， Cheng J and Wang J D. 2022b. MixFormer： mixing features across windows and dimensions//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 5239-5249 ［DOI： 10.1109/cvpr52688.2022.00518http://dx.doi.org/10.1109/cvpr52688.2022.00518］

Chen X， Yan B， Zhu J W， Wang D， Yang X Y and Lu H C. 2021a. Transformer tracking//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 8122-8131 ［DOI： 10.1109/CVPR46437.2021.00803http://dx.doi.org/10.1109/CVPR46437.2021.00803］

Chen Y P， Dai X Y， Chen D D， Liu M C， Dong X Y， Yuan L and Liu Z C. 2022a. Mobile-Former： bridging MobileNet and Transformer//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 5260-5269 ［DOI： 10.1109/cvpr52688.2022.00520http://dx.doi.org/10.1109/cvpr52688.2022.00520］

Chu X X， Tian Z， Wang Y Q， Zhang B， Ren H B， Wei X L， Xia H X and Shen C H. 2021. Twins： revisiting the design of spatial attention in vision Transformers//Proceedings of the 35th Conference on Neural Information Processing Systems.Virtual， Canada： Curran Associates， Inc.： 9355-9366

Chu X X， Tian Z， Zhang B， Wang X L and Shen C H. 2023. Conditional positional encodings for vision Transformers ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2102.10882.pdfhttps://arxiv.org/pdf/2102.10882.pdf

Dai Z H， Liu H X， Le Q V and Tan M X. 2021. CoAtNet： marrying convolution and attention for all data sizes//Proceedings of the 35th Conference on Neural Information Processing Systems. Virtual， Canada： Curran Associates， Inc.： 3965-3977

D’Ascoli S， Touvron H， Leavitt M L， Morcos A S， Biroli G and Sagun L. 2021. ConViT： improving vision Transformers with soft convolutional inductive biases//Proceedings of the 38th International Conference on Machine Learning. Virtual： PMLR： 2286-2296

Deng J， Dong W， Socher R， Li L J， Li K and Li F F. 2009. ImageNet： a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Miami， USA： IEEE： 248-255 ［DOI： 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848］

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional Transformers for language understanding ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/1810.04805.pdfhttps://arxiv.org/pdf/1810.04805.pdf

Ding X H， Zhang X Y， Han J G and Ding G G. 2022. Scaling up your kernels to 31×31： revisiting large kernel design in CNNs//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 11963-11975 ［DOI： 10.1109/cvpr52688.2022.01166http://dx.doi.org/10.1109/cvpr52688.2022.01166］

Dong X Y， Bao J M， Chen D D， Zhang W M， Yu N H， Yuan L， Chen D and Guo B N. 2022. CSWin Transformer： a general vision Transformer backbone with cross-shaped windows//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 12114-12124 ［DOI： 10.1109/cvpr52688.2022.01181http://dx.doi.org/10.1109/cvpr52688.2022.01181］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［2021-06-03］. https：//arxiv.org/pdf/2020.11929.pdfhttps://arxiv.org/pdf/2020.11929.pdf

Fan H Q， Xiong B， Mangalam K， Li Y H， Yan Z C， Malik J and Feichtenhofer C. 2021. Multiscale vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 6824-6835 ［DOI： 10.1109/iccv48922.2021.00675http://dx.doi.org/10.1109/iccv48922.2021.00675］

Gehring J， Auli M， Grangier D， Yarats D and Dauphin Y N. 2017. Convolutional sequence to sequence learning//Proceedings of the 34th International Conference on Machine Learning. Sydney， Australia： PMLR.org： 1243-1252

Graham B， El-Nouby A， Touvron H， Stock P， Joulin A， Jégou H and Douze M. 2021. LeViT： a vision Transformer in ConvNet’s clothing for faster inference//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 12239-12249 ［DOI： 10.1109/iccv48922.2021.01204http://dx.doi.org/10.1109/iccv48922.2021.01204］

Guo J Y， Han K， Wu H， Tang Y H， Chen X H， Wang Y H and Xu C. 2022. CMT： convolutional neural networks meet vision Transformers//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 12165-12175 ［DOI： 10.1109/cvpr52688.2022.01186http://dx.doi.org/10.1109/cvpr52688.2022.01186］

Han K， Guo J Y， Tang Y H and Wang Y H. 2022. PyramidTNT： improved Transformer-in-Transformer baselines with pyramid architecture ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2201.00978.pdfhttps://arxiv.org/pdf/2201.00978.pdf

Han K， Wang Y H， Chen H T， Chen X H， Guo J Y， Liu Z H， Tang Y H， Xiao A， Xu C J， Xu Y X， Yang Z H， Zhang Y M and Tao D C. 2023. A survey on vision Transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence， 45（1）： 87-110 ［DOI： 10.1109/TPAMI. 2022.3152247http://dx.doi.org/10.1109/TPAMI.2022.3152247］

Han K， Xiao A， Wu E H， Guo J Y， Xu C J and Wang Y H. 2021. Transformer in Transformer ［EB/OL］. ［2022-08-22］. https：//arxiv.org/pdf/2103.00112.pdfhttps://arxiv.org/pdf/2103.00112.pdf

Hassani A， Walton S， Li J C， Li S and Shi H. 2022a. Neighborhood attention Transformer ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2204.07143.pdfhttps://arxiv.org/pdf/2204.07143.pdf

Hassani A， Walton S， Shah N， Abuduweili A， Li J C and Shi H. 2022b. Escaping the big data paradigm with compact Transformers ［EB/OL］. ［2022-06-07］. https://arxiv.org/pdf/2104.05704.pdfhttps://arxiv.org/pdf/2104.05704.pdf

Hatamizadeh A， Yin H X， Kautz J and Molchanov P. 2023. Global context vision Transformers ［EB/OL］. ［2023-02-06］. https://arxiv.org/pdf/2206.09959.pdfhttps://arxiv.org/pdf/2206.09959.pdf

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Hendrycks D and Gimpel K. 2020. Gaussian error linear units （GELUs）［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/1606.08415.pdfhttps://arxiv.org/pdf/1606.08415.pdf

Heo B， Yun S， Han D， Chun S， Choe J and Oh S J. 2021. Rethinking spatial dimensions of vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 11916-11925 ［DOI： 10.1109 /iccv48922.2021.01172http://dx.doi.org/10.1109/iccv48922.2021.01172］

Hong D F， Han Z， Yao J， Gao L R， Zhang B， Plaza A and Chanussot J. 2022. SpectralFormer： rethinking hyperspectral image classification with Transformers. IEEE Transactions on Geoscience and Remote Sensing， 60： #5518615 ［DOI： 10.1109/TGRS.2021.3130716http://dx.doi.org/10.1109/TGRS.2021.3130716］

Hou Q B， Zhou D Q and Feng J S. 2021. Coordinate attention for efficient mobile network design//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 13708-13717 ［DOI： 10.1109/cvpr46437.2021.01350http://dx.doi.org/10.1109/cvpr46437.2021.01350］

Howard A G， Zhu M L， Chen B， Kalenichenko D， Wang W J， Weyand T， Andreetto M and Adam H. 2017. MobileNets： efficient convolutional neural networks for mobile vision applications ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/1704.04861.pdfhttps://arxiv.org/pdf/1704.04861.pdf

Hu H， Zhang Z， Xie Z D and Lin S. 2019. Local relation networks for image recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 3463-3472 ［DOI： 10.1109/iccv.2019.00356http://dx.doi.org/10.1109/iccv.2019.00356］

Huang G， Liu Z， van der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 2261-2269 ［DOI： 10.1109/CVPR.2017.243http://dx.doi.org/10.1109/CVPR.2017.243］

Huang T， Huang L， You S， Wang F， Qian C and Xu C. 2022. LightViT： towards light-weight convolution-free vision Transformers ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2207.05557.pdfhttps://arxiv.org/pdf/2207.05557.pdf

Jia S and Wang Y F. 2022. Multiscale convolutional Transformer with center mask pretraining for hyperspectral image classification. ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2203.04771.pdfhttps://arxiv.org/pdf/2203.04771.pdf

Jiang Y F， Chang S Y and Wang Z Y. 2021. TransGAN： two pure Transformers can make one strong GAN， and that can scale up. ［EB/OL］. ［2022-08-22］. https：//arxiv.org/pdf/2102.07074.pdfhttps://arxiv.org/pdf/2102.07074.pdf

Kayhan O S and van Gemert J C. 2020. On translation invariance in CNNs： convolutional layers can exploit absolute spatial location//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 14262-14273 ［DOI： 10.1109/cvpr42600.2020.01428http://dx.doi.org/10.1109/cvpr42600.2020.01428］

Khan S H， Naseer M， Hayat M， Zamir S W， Khan F S and Shah M. 2022. Transformers in vision： a survey. ACM Computing Surveys， 54（10s）： #200 ［DOI： 10.1145/3505244http://dx.doi.org/10.1145/3505244］

Lee S H， Lee S and Song B C. 2021. Vision Transformer for small-size datasets ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2112.13492.pdfhttps://arxiv.org/pdf/2112.13492.pdf

Li J P， Yan Y C， Liao S C， Yang X K and Shao L. 2021a. Local-to-global self-attention in vision Transformers ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2107.04735.pdfhttps://arxiv.org/pdf/2107.04735.pdf

Li K C， Wang Y L， Gao P， Song G L， Liu Y， Li H S and Qiao Y. 2022a. UniFormer： unified Transformer for efficient spatiotemporal representation learning ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2201.04676.pdfhttps://arxiv.org/pdf/2201.04676.pdf

Li W， Wang X， Xia X， Wu J， Xiao X F， Zheng M and Wen S P. 2022c. SepViT： separable vision Transformer ［EB/OL］. ［2022-05-07］. https://arxiv.org/pdf/2203.15380.pdfhttps://arxiv.org/pdf/2203.15380.pdf

Li Y H， Wu C Y， Fan H Q， Mangalam K， Xiong B， Malik J and Feichtenhofer C.2022b.MViTv2： improved multiscale vision Transformers for classification and detection//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 4794-4804 ［DOI： 10.1109/cvpr52688.2022.00476http://dx.doi.org/10.1109/cvpr52688.2022.00476］

Li Y H， Yao T， Pan Y W and Mei T. 2023. Contextual Transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 45（2）： 1489-1500 ［DOI： 10.1109/TPAMI.2022.3164083http://dx.doi.org/10.1109/TPAMI.2022.3164083］

Li Y W， Zhang K， Cao J Z， Timofte R and van Gool L. 2021b. LocalViT： bringing locality to vision Transformers ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2104.05707.pdfhttps://arxiv.org/pdf/2104.05707.pdf

Lin H Z， Cheng X， Wu X Y， Yang F， Shen D， Wang Z Y， Song Q and Yuan W. 2021. CAT： cross attention in vision Transformer ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2106.05786.pdfhttps://arxiv.org/pdf/2106.05786.pdf

Lin T Y， Wang Y X， Liu X Y and Qiu X P. 2022. A survey of Transformers. AI Open， 3： 111-132 ［DOI： 10.1016/j.aiopen.2022.10.001http://dx.doi.org/10.1016/j.aiopen.2022.10.001］

Liu B， Yu A Z， Gao K L， Tan X， Sun Y F and Yu X C. 2022c. DSS-TRM： deep spatial-spectral Transformer for hyperspectral image classification. European Journal of Remote Sensing， 55（1）： 103-114 ［DOI： 10.1080/ 22797254.2021.2023910http://dx.doi.org/10.1080/22797254.2021.2023910］

Liu Q C， Xiao L and Yang J X. 2021. Parallel implementation of content-guided deep convolutional network for hyperspectral image classification. Journal of Image and Graphics， 26（8）： 1926-1939

刘启超，肖亮，杨劲翔. 2021. 面向高光谱图像分类的内容引导卷积深度网络并行实现. 中国图象图形学报， 26（8）： 1926-1939 ［DOI： 10.11834/jig.200411http://dx.doi.org/10.11834/jig.200411］

Liu S W， Chen T L， Chen X H， Chen X X， Xiao Q， Wu B Q， Kärkkäinen T， Pechenizkiy M， Mocanu D C and Wang Z Y. 2023. More convnets in the 2020s： scaling up kernels beyond 51x51 using sparsity ［EB/OL］. ［2023-03-03］. https://arxiv.org/pdf/2207.03620.pdfhttps://arxiv.org/pdf/2207.03620.pdf

Liu Y， Zhang Y， Wang Y X， Hou F， Yuan J， Tian J， Zhang Y， Shi Z C， Fan J P and He Z Q. 2022b. A survey of visual Transformers ［EB/OL］. ［2022-08-22］. https：//arxiv.org/pdf/2111.06091.pdfhttps://arxiv.org/pdf/2111.06091.pdf

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021. Swin Transformer： hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Liu Z， Mao H Z， Wu C Y， Feichtenhofer C， Darrell T and Xie S N. 2022a. A convnet for the 2020s//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 11966-11976 ［DOI： 10.1109/cvpr52688.2022.01167http://dx.doi.org/10.1109/cvpr52688.2022.01167］

Maaz M， Shaker A， Cholakkal H， Khan S， Zamir S W， Anwer R M and Khan F S. 2022. Edgenext： efficiently amalgamated cnn-Transformer architecture for mobile vision applications//Proceedings of the 17th European Conference on Computer Vision （ECCV）. Tel Aviv， Israel： Springer：3-20［DOI： 10.1007/978-3-031-25082-8 _1http://dx.doi.org/10.1007/978-3-031-25082-8_1］

Mehta S and Rastegari M. 2022. MobileViT： light-weight， general-purpose， and mobile-friendly vision Transformer［EB/OL］. ［2022-08-22］. https：//arxiv.org/pdf/2110.02178.pdfhttps://arxiv.org/pdf/2110.02178.pdf

Pan J T， Bulat A， Tan F W， Zhu X T， Dudziak L， Li H S， Tzimiropoulos G and Martinez B. 2022. EdgeViTs： competing light-weight CNNs on mobile devices with vision Transformers//Proceedings of the 17th European Conference on Computer Vision （ECCV）. Tel Aviv， Israel： Springer： 294-311 ［DOI： 10.1007/978-3-031- 20083-0_18http://dx.doi.org/10.1007/978-3-031-20083-0_18］

Pan Z Z， Cai J F and Zhuang B H. 2023. Fast vision Transformers with hilo attention ［EB/OL］. ［2023-01-19］. https://arxiv.org/pdf/2205.13213.pdfhttps://arxiv.org/pdf/2205.13213.pdf

Peng Z L， Huang W， Gu S Z， Xie L X， Wang Y W， Jiao J B and Ye Q X. 2021. Conformer： local features coupling global representations for visual recognition//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 357-366 ［DOI： 10.1109/iccv48922.2021.00042http://dx.doi.org/10.1109/iccv48922.2021.00042］

Radosavovic I， Kosaraju R P， Girshick R， He K M and Doll􀆦r P. 2020. Designing network design spaces//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 10425-10433 ［DOI： 10.1109/CVPR42600.2020.01044http://dx.doi.org/10.1109/CVPR42600.2020.01044］

Ramachandran P， Parmar N， Vaswani A， Bello I， Levskaya A and Shlens J. 2019. Stand-alone self-attention in vision models//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： 68-80

Roy S K， Deria A， Hong D F， Rasti B， Plaza A and Chanussot J. 2022. Multimodal fusion Transformer for remote sensing image classification ［EB/OL］. ［2022-03-31］. https://arxiv.org/pdf/2203.16952.pdfhttps://arxiv.org/pdf/2203.16952.pdf

Shaw P， Uszkoreit J and Vaswani A. 2018. Self-attention with relative position representations//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Short Papers）. New Orleans， USA： Association for Computational Linguistics： 464-468 ［DOI： 10.18653/v1/n18-2074http://dx.doi.org/10.18653/v1/n18-2074］

Si C Y， Yu W H， Zhou P， Zhou Y C， Wang X C and Yan S C. 2022. Inception Transformer ［EB/OL］. ［2022-05-26］. https://arxiv.org/pdf/2205.12956.pdfhttps://arxiv.org/pdf/2205.12956.pdf

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego， USA：［s.n.］

Srinivas A， Lin T Y， Parmar N， Shlens J， Abbeel P and Vaswani A. 2021. Bottleneck Transformers for visual recognition//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 16514-16524 ［DOI： 10.1109/cvpr46437.2021.01625http://dx.doi.org/10.1109/cvpr46437.2021.01625］

Su Z X， Zhang H， Chen J J， Pang L， Ngo C W and Jiang Y G. 2022. Adaptive split-fusion Transformer ［EB/OL］. ［2022-04-26］. https://arxiv.org/pdf/2204.12196.pdfhttps://arxiv.org/pdf/2204.12196.pdf

Sun W X， Qin Z， Deng H， Wang J Y， Zhang Y， Zhang K H， Barnes N， Birchfield S， Kong L P and Zhong Y R. 2022. Vicinity vision Transformer ［EB/OL］. ［2022-06-21］. https://arxiv.org/pdf/2206.10552.pdfhttps://arxiv.org/pdf/2206.10552.pdf

Sun X H， Guan Z and Wang X. 2023. Vision Transformer for fusing infrared and visible images in groups. Journal of Image and Graphics， 28（1）： 166-178

孙旭辉，官铮，王学. 2023. 红外与可见光图像分组融合的视觉Transformer. 中国图象图形学报， 28（1）： 166-178 ［DOI： 10.11834/jig.220515http://dx.doi.org/10.11834/jig.220515］

Szegedy C， Liu W， Jia Y Q， Sermanet P， Reed S， Anguelov D， Erhan D， Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Boston， USA： IEEE： #7298594 ［DOI： 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594］

Tan M X and Le Q V. 2019. EfficientNet： rethinking model scaling for convolutional neural networks//Proceedings of the 36th International Conference on Machine Learning. Long Beach， USA：［s.n.］： 6105-6114

Tatsunami Y and Taki M. 2023. Sequencer： deep LSTM for image classification ［EB/OL］. ［2023-01-12］. https：//arxiv.org/pdf/2205.01972.pdfhttps://arxiv.org/pdf/2205.01972.pdf

Tay Y， Dehghani M， Bahri D and Metzler D. 2023. Efficient Transformers： a survey. ACM Computing Surveys， 55（6）： #109 ［DOI： 10.1145/3530811http://dx.doi.org/10.1145/3530811］

Touvron H， Cord M， Douze M， Massa F， Sablayrolles A and Jégou H. 2021a. Training data-efficient image Transformers and distillation through attention//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR： 10347-10357

Touvron H， Cord M， Sablayrolles A， Synnaeve G and Jégou H. 2021b. Going deeper with image Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 32-42 ［DOI： 10.1109/iccv48922.2021.00010http://dx.doi.org/10.1109/iccv48922.2021.00010］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang H Y， Zhu Y K， Adam H， Yuille A and Chen L C. 2021a. MaX-DeepLab： end-to-end panoptic segmentation with mask Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 5459-5470 ［DOI： 10.1109/cvpr46437.2021.00542http://dx.doi.org/10.1109/cvpr46437.2021.00542］

Wang S N， Li B Z， Khabsa M， Fang H and Ma H. 2020. Linformer： self-attention with linear complexity ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2006.04768.pdfhttps://arxiv.org/pdf/2006.04768.pdf

Wang W H， Xie E Z， Li X， Fan D P， Song K T， Liang D， Lu T， Luo P and Shao L. 2021b. Pyramid vision Transformer： a versatile backbone for dense prediction without convolutions//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 548-558 ［DOI： 10.1109/iccv48922.2021.00061http://dx.doi.org/10.1109/iccv48922.2021.00061］

Wang W H， Xie E Z， Li X， Fan D P， Song K T， Liang D， Lu T， Luo P and Shao L. 2022. PVT v2： improved baselines with pyramid vision Transformer. Computational Visual Media， 8（3）： 415-424 ［DOI： 10.1007/s41095-022-0274-8http://dx.doi.org/10.1007/s41095-022-0274-8］

Wang W X， Yao L， Chen L， Lin B B， Cai D， He X F and Liu W. 2021c. CrossFormer： a versatile vision Transformer hinging on cross-scale attention ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2108.00154.pdfhttps://arxiv.org/pdf/2108.00154.pdf

Wu C H， Wu F Z， Qi T， Huang Y F and Xie X. 2021a. Fastformer： additive attention can be all you need ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2108.09084.pdfhttps://arxiv.org/pdf/2108.09084.pdf

Wu H P， Xiao B， Codella N， Liu M C， Dai X Y， Yuan L and Zhang L. 2021b. CvT： introducing convolutions to vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 22-31 ［DOI： 10.1109/iccv48922.2021.00009http://dx.doi.org/10.1109/iccv48922.2021.00009］

Xiao T， Singh M， Mintun E， Darrell T， Doll􀆦r P and Girshick R. 2021. Early convolutions help Transformers see better//Proceedings of the 35th International Conference on Neural Information Processing Systems.Online Conference， Canada： Curran Associates， Inc.： 30392-30400

Xie S N， Girshick R， Doll􀆦r P， Tu Z W and He K M. 2017. Aggregated residual transformations for deep neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 5987-5995 ［DOI： 10.1109/cvpr.2017.634http://dx.doi.org/10.1109/cvpr.2017.634］

Xu Y F， Wei H P， Lin M X， Deng Y Y， Sheng K K， Zhang M D， Tang F， Dong W M， Huang F Y and Xu C S. 2022. Transformers in computational visual media： a survey. Computational Visual Media， 8（1）： 33-62 ［DOI： 10.1007/s41095-021-0247-3http://dx.doi.org/10.1007/s41095-021-0247-3］

Xue Z X， Tan X， Yu X C， Liu B， Yu A Z and Zhang P Q. 2022. Deep hierarchical vision Transformer for hyperspectral and lidar data classification. IEEE Transactions on Image Processing， 31： 3095-3110 ［DOI： 10.1109/TIP.2022.3162964http://dx.doi.org/10.1109/TIP.2022.3162964］

Yang R， Ma H L， Wu J， Tang Y S， Xiao X F， Zheng M and Li X. 2022. ScalableViT： rethinking the context-oriented generalization of vision Transformer//Proceedings of the 17th European Conference on Computer Vision （ECCV）. Tel Aviv， Israel： Springer： 480-496 ［DOI： 10.1007/978-3-031-20053-3_28http://dx.doi.org/10.1007/978-3-031-20053-3_28］

Yu T， Zhao G M， Li P and Yu Y Z. 2022a. BOAT： bilateral local attention vision Transformer ［EB/OL］. ［2022-10-19］. https://arxiv.org/pdf/2201.13027.pdfhttps://arxiv.org/pdf/2201.13027.pdf

Yu W H， Luo M， Zhou P， Si C Y， Zhou Y C， Wang X C， Feng J S and Yan S C. 2022b. MetaFormer is actually what you need for vision//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 10809-10819 ［DOI： 10.1109/cvpr52688.2022.01055http://dx.doi.org/10.1109/cvpr52688.2022.01055］

Yuan K， Guo S P， Liu Z W， Zhou A J， Yu F W and Wu W. 2021b. Incorporating convolution designs into visual Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 559-568 ［DOI： 10.1109/iccv48922.2021.00062http://dx.doi.org/10.1109/iccv48922.2021.00062］

Yuan L， Chen Y P， Wang T， Yu W H， Shi Y J， Jiang Z H， Tay F E H， Feng J S and Yan S C. 2021c. Tokens-to-token ViT： training vision Transformers from scratch on imagenet//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 538-547 ［DOI： 10.1109/iccv48922.2021.00060http://dx.doi.org/10.1109/iccv48922.2021.00060］

Yuan L， Hou Q B， Jiang Z H， Feng J S and Yan S C. 2021a. VOLO： vision outlooker for visual recognition ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2106.13112.pdfhttps://arxiv.org/pdf/2106.13112.pdf

Zhang K， Feng X H， Guo Y R， Su Y K， Zhao K， Zhao Z B， Ma Z Y and Ding Q L. 2021. Overview of deep convolutional neural networks for image classification. Journal of Image and Graphics， 26（10）： 2305-2325

张珂，冯晓晗，郭玉荣，苏昱坤，赵凯，赵振兵，马占宇，丁巧林. 2021. 图像分类的深度卷积神经网络模型综述. 中国图象图形学报， 26（10）： 2305-2325 ［DOI： 10.11834/jig.200302http://dx.doi.org/10.11834/jig.200302］

Zhang Q L and Yang Y B. 2021. ResT： an efficient Transformer for visual recognition//Proceedings of the 35th International Conference on Neural Information Processing Systems.Online Conference， Canada： Curran Associates， Inc.： 15475-15485

Zhao Z G， Hu D， Wang H and Yu X C. 2022. Convolutional Transformer network for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters， 19： #6009005 ［DOI： 10.1109/LGRS.2022. 3169815http://dx.doi.org/10.1109/LGRS.2022.3169815］

Zhou D Q， Kang B Y， Jin X J， Yang L J， Lian X C， Jiang Z H， Hou Q B and Feng J S. 2021. DeepViT： towards deeper vision Transformer ［EB/OL］. ［2022-08-22］. https://arxiv.org/pdf/2103.11886.pdfhttps://arxiv.org/pdf/2103.11886.pdf

文章被引用时，请邮件提醒。

提交

智能交通系统中的车辆标志识别方法综述

轻量级图像超分辨率的蓝图可分离卷积Transformer网络

图像去模糊研究综述

TransAS-UNet:融合Swin Transformer和UNet的乳腺癌区域分割

融合帧间时序关系的标准胎儿四腔心超声切面自动获取