视觉基础模型研究现状与发展趋势
Research status and development trends of vision foundation models
- 2025年30卷第1期 页码:1-24
纸质出版日期: 2025-01-16
DOI: 10.11834/jig.230911
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2025-01-16 ,
移动端阅览
张燚钧, 张润清, 周华健, 齐骥, 余肇飞, 黄铁军. 视觉基础模型研究现状与发展趋势[J]. 中国图象图形学报, 2025,30(1):1-24.
ZHANG YIJUN, ZHANG RUNQING, ZHOU HUAJIAN, QI JI, YU ZHAOFEI, HUANG TIEJUN. Research status and development trends of vision foundation models. [J]. Journal of image and graphics, 2025, 30(1): 1-24.
在计算机视觉领域,尽管传统的深度学习视觉模型在特定任务上表现出色,但它们对大量标注数据的高度依赖及在新场景下性能泛化的局限性,大大增加了使用成本并限制了模型的应用范围。近年来,以Transformer为核心的新型模型结构,特别是在自监督学习领域的应用,为解决这些挑战提供了新的解决方案。这些模型通常通过大规模数据预训练,在处理复杂视觉场景中展现出强大的泛化能力,其被广泛称为视觉基础模型。本文深入探讨了视觉基础模型的研究现状与未来发展趋势,并重点关注该领域的关键技术进展及其对未来计算机视觉的潜在影响。首先回顾和梳理了视觉基础模型的背景与发展历程,然后介绍了在这一发展历程中出现的关键模型基础结构,介绍并分析了构建视觉基础模型所采用的各类预训练任务的设计思路,并根据其特性对现有的视觉基础模型进行分类。同时,对不同类型视觉基础模型中的代表性工作进行了介绍,并整理了目前可用于视觉基础模型预训练的数据集。最后,对视觉基础模型的研究现状进行总结和思考,提出了目前存在的一些挑战,并展望未来可能的研究方向。
In the field of computer vision, traditional deep learning vision models have exhibited remarkable performance on specific tasks. However, their substantial dependency on large amounts of annotated data and limited capability in generalization across new scenes significantly elevate usage costs and restrict the application scope of these models. Recently, novel model architectures centered around the Transformer, particularly in the domain of self-supervised learning, have emerged as solutions to these challenges. These models, typically pre-trained on extensive datasets, demonstrate robust generalization capabilities in complex visual scenarios and are widely recognized as vision foundation models. This paper delves into the current research status and future trends of vision foundation models, with a focus on key technological advancements in this field and their potential impact on future developments in computer vision. The paper begins by reviewing and organizing the background and developmental history of vision foundation models, followed by an introduction to the key model structures that have emerged in this developmental trajectory. The article further introduces and analyzes the design philosophies of various pre-training tasks employed in constructing vision foundation models, categorizing the existing models based on their characteristics. Additionally, the paper presents representative works in different types of vision foundation models and compiles the currently available datasets for pre-training these models. Finally, the paper summarizes the current research status of vision foundation models, reflects on existing challenges, and anticipates potential future research directions. This paper offers an expansive examination of the landscape of visual foundation models, chronicling their evolution, current achievements, and charting a course for future research. It acknowledges the Transformative impact of deep learning on computer vision, shifting the paradigm from traditional computational methods to models that excel in specialized tasks. However, it also confronts the limitations of these models, particularly their narrow applicability and reliance on extensive, meticulously annotated datasets, which have elevated deployment costs and restricted versatility. In response, the emergence of Transformer-based architectures has instigated a paradigm shift, leading to the development of vision foundation models that are redefining the capabilities and breadth of applicability of computer vision systems. This paper provides a systematic review of these models, offering historical context that underscores the transition from traditional deep learning models to the current paradigm involving Transformer-based models. It delves into the core structures of these models, such as the Transformer and vision Transformer, discussing their architectural intricacies and the principles that underpin their design, enabling them to capture the complexities of visual data with nuance and accuracy. A pivotal contribution of this paper is the thorough analysis of pre-training tasks, which is foundational to the construction of robust vision foundation models. It categorizes these tasks on the basis of their design philosophies and their effectiveness in enabling models to learn rich feature representations from large-scale datasets, thereby enhancing their generalization capabilities across a myriad of computer vision tasks. This article primarily introduces various pre-training methods such as supervised learning, image contrastive learning, image-text contrastive learning, and masked image modeling. It analyzes the characteristics of these pre-training tasks, the representative works corresponding to each, as well as their applicable scenarios and potential directions for improvement. This paper introduces existing vision foundation models based on mixture of experts according to universal visual representation backbones, aligned with language modalities, and generative multi-task models. Specifically, the article first analyzes the background of each type of foundation model, the core ideas of the models, and the pioneering work. Then, it analyzes the representative works in the development of each type of foundation model. Finally, it provides a prospect based on the strengths and weaknesses of each method. The paper also evaluates existing vision foundation models, scrutinizing their characteristics, capabilities, and the datasets utilized for their pre-training, providing an in-depth analysis of their performance on a variety of tasks, including image classification, object detection, and semantic segmentation. This paper delves into the critical component of pre-training datasets that serve as the foundational resources for the development and refinement of visual foundation models. It presents a comprehensive overview of the extensive image datasets and the burgeoning realm of image-text datasets that are instrumental in the pre-training phase of these models. The discussion commences with the seminal ImageNet dataset, which has been crucial in computer vision research and the benchmark for evaluating model performance. The paper then outlines the expansive ImageNet-21k and the colossal JFT-300M/3B datasets, highlighting their scale and the implications of such magnitude on model training and generalization capabilities. The COCO and ADE20k datasets are examined for their role in tasks such as object detection and semantic segmentation, underlining their contribution to the diversity and complexity of pre-training data. The Object365 dataset is also acknowledged for its focus on open-world target detection, offering a rich resource for model exposure to a wide array of visual categories. In addition to image datasets, the paper underscores the importance of image-text datasets such as Conceptual Captions and Flickr30k, which are becoming increasingly vital for models that integrate multimodal inputs. These datasets provide the necessary linkage between visual content and textual descriptions, enabling models to develop a deeper understanding of the semantic context. The paper anticipates research directions such as establishing comprehensive benchmark evaluation systems, enhancing cross-modal capabilities, expanding the coverage of various visual tasks, leveraging structured knowledge bases for training, and developing model compression and optimization techniques to facilitate the deployment of these models in real-world scenarios. The ultimate goal is to develop visual foundation models that are more versatile and intelligent, capable of addressing complex visual problems in the real world. Reflecting on the challenges faced by the field, the paper identifies the pressing need for more efficient training algorithms, the development of better evaluation metrics, and the integration of multimodal data. This paper also highlights several challenges, including the need for more unified model paradigms in computer vision, the development of effective performance improvement paths similar to the “scaling law” observed in NLP, and the necessity for new evaluation metrics that can assess models’ cross-modal understanding and performance across a wide range of tasks. It anticipates future research directions, such as the modularization of vision foundation models to enhance their adaptability and the exploration of weakly supervised learning techniques, which aim to diminish reliance on large annotated datasets. One of the key contributions of this paper is the in-depth discussion on the real-world applications of vision foundation models. It explores their implications for tasks such as medical image analysis, autonomous driving, and surveillance, underscoring their transformative potential and profound possible impact on these domains. In conclusion, this paper synthesizes the current state of research on vision foundation models and outlines opportunities for future advancements. It emphasizes the importance of continued interdisciplinary research to unlock the full potential of vision foundation models and address the intricate challenges in the field of computer vision. The paper’s advocacy for an interdisciplinary approach is underpinned by the belief that it will foster innovation and enable the development of models that are not only more efficient and adaptable but also capable of addressing complex and multifaceted problems in the real world.
基础模型计算机视觉(CV)预训练模型自监督学习多任务学习
foundation modelcomputer vision(CV)pre-training modelself-supervised learningmulti-task learning
Alayrac J B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, Cabi S, Han T D, Gong Z T, Samangooei S, Monteiro M, Menick J, Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Bińkowski M, Barreira R, Vinyals O, Zisserman A and Simonyan K. 2022. Flamingo: a visual language model for few-shot learning//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 23716-23736
Bao H B, Dong L, Piao S H and Wei F R. 2022. BEiT: BERT pre-training of image Transformers [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2106.08254.pdfhttp://arxiv.org/pdf/2106.08254.pdf
Bao H B, Dong L, Wei F R, Wang W H, Yang N, Liu X D, Wang Y, Piao S H, Gao J F, Zhou M and Hon H W. 2020. UniLMv2: pseudo-masked language models for unified language model pre-training//Proceedings of the 37th International Conference on Machine Learning. Virtually: JMLR.org: 642-652
Bommasani R, Hudson D A, Adeli E, Altman R, Arora S, von Arx S, Bernstein M S, Bohg J, Bosselut A, Brunskill E, Brynjolfsson E, Buch S, Card D, Castellon R, Chatterji N, Chen A, Creel K, Davis J Q, Demszky D, Donahue C, Doumbouya M, Durmus E, Ermon S, Etchemendy J, Ethayarajh K, Fei-Fei L, Finn C, Gale T, Gillespie L, Goel K, Goodman N, Grossman S, Guha N, Hashimoto T, Henderson P, Hewitt J, Ho D E, Hong J, Hsu K, Huang J, Icard T, Jain S, Jurafsky D, Kalluri P, Karamcheti S, Keeling G, Khani F, Khattab O, Koh P W, Krass M, Krishna R, Kuditipudi R, Kumar A, Ladhak F, Lee M, Lee T, Leskovec J, Levent I, Li X L, Li X C, Ma T Y, Malik A, Manning C D, Mirchandani S, Mitchell E, Munyikwa Z, Nair S, Narayan A, Narayanan D, Newman B, Nie A, Niebles J C, Nilforoshan H, Nyarko J, Ogut G, Orr L, Papadimitriou I, Park J S, Piech C, Portelance E, Potts C, Raghunathan A, Reich R, Ren H Y, Rong F, Roohani Y, Ruiz C, Ryan J, Ré C, Sadigh D, Sagawa S, Santhanam K, Shih A, Srinivasan K, Tamkin A, Taori R, Thomas A W, Tramèr F, Wang R E, Wang W, Wu B H, Wu J J, Wu Y H, Xie S M, Yasunaga M, You J X, Zaharia M, Zhang M, Zhang T Y, Zhang X K, Zhang Y H, Zheng L, Zhou K and Liang P. 2022. On the opportunities and risks of foundation models [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2108.07258.pdfhttp://arxiv.org/pdf/2108.07258.pdf
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 1877-1901
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A and Zagoruyko S. 2020. End-to-end object detection with Transformers//Proceedings of the 16th European Conference Computer Vision—ECCV 2020. Glasgow, UK: Springer: 213-229 [DOI: 10.1007/978-3-030-58452-8_13http://dx.doi.org/10.1007/978-3-030-58452-8_13]
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P and Joulin A. 2020. Unsupervised learning of visual features by contrasting cluster assignments//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 9912-9924
Caron M, Touvron H, Misra I, Jegou H, Mairal J, Bojanowski P and Joulin A. 2021. Emerging properties in self-supervised vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 9630-9640 [DOI: 10.1109/ICCV48922.2021.00951http://dx.doi.org/10.1109/ICCV48922.2021.00951]
Changpinyo S, Sharma P, Ding N and Soricut R. 2021. Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 3557-3567 [DOI: 10.1109/CVPR46437.2021.00356http://dx.doi.org/10.1109/CVPR46437.2021.00356]
Chen J N, Lu Y Y, Yu Q H, Luo X D, Adeli E, Wang Y, Lu L, Yuille A L and Zhou Y Y. 2021a. TransUNet: Transformers make strong encoders for medical image segmentation [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2102.04306.pdfhttp://arxiv.org/pdf/2102.04306.pdf
Chen M, Radford A, Child R, Wu J, Jun H, Luan D and Sutskever I. 2020a. Generative pretraining from pixels//Proceedings of the 37th International Conference on Machine Learning. Virtually: JMLR.org: 1691-1703
Chen T, Kornblith S, Norouzi M and Hinton G. 2020b. A simple framework for contrastive learning of visual representations//Proceedings of the 37th International Conference on Machine Learning. Virtually: JMLR.org: 1597-1607
Chen T, Saxena S, Li L L, Fleet D J and Hinton G. 2022. Pix2seq: a language modeling framework for object detection [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2109.10852.pdfhttp://arxiv.org/pdf/2109.10852.pdf
Chen X, Wang X, Changpinyo S, Piergiovanni A J, Padlewski P, Salz D, Goodman S, Grycner A, Mustafa B, Beyer L, Kolesnikov A, Puigcerver J, Ding N, Rong K R, Akbari H, Mishra G, Xue L T, Thapliyal A, Bradbury J, Kuo W C, Seyedhosseini M, Jia C, Ayan B K, Riquelme C, Steiner A, Angelova A, Zhai X H, Houlsby N and Soricut R. 2023. PaLI: a jointly-scaled multilingual language-image model [EB/OL]. [2023-12-06]. http://arxiv.org/pdf/2209.06794.pdfhttp://arxiv.org/pdf/2209.06794.pdf
Chen X K, Ding M Y, Wang X D, Xin Y, Mo S T, Wang Y H, Han S M, Luo P, Zeng G and Wang J D. 2024. Context autoencoder for self-supervised representation learning. International Journal of Computer Vision, 132(1): 208-223 [DOI: 10.1007/s11263-023-01852-4http://dx.doi.org/10.1007/s11263-023-01852-4]
Chen X L and He K M. 2021. Exploring simple Siamese representation learning//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 15745-15753 [DOI: 10.1109/CVPR46437.2021.01549http://dx.doi.org/10.1109/CVPR46437.2021.01549]
Chen X L, Xie S N and He K M. 2021b. An empirical study of training self-supervised vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 9620-9629 [DOI: 10.1109/ICCV48922.2021.00950http://dx.doi.org/10.1109/ICCV48922.2021.00950]
Cho K, van Merrienboer B, Bahdanau D and Bengio Y. 2014. On the properties of neural machine translation: encoder-decoder approaches [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/1409.1259.pdfhttp://arxiv.org/pdf/1409.1259.pdf
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung H W, Sutton C, Gehrmann S, Schuh P, Shi K S, Tsvyashchenko S, Maynez J, Rao A, Barnes P, Tay Y, Shazeer N, Prabhakaran V, Reif E, Du N, Hutchinson B, Pope R, Bradbury J, Austin J, Isard M, Gur-Ari G, Yin P C, Duke T, Levskaya A, Ghemawat S, Dev S, Michalewski H, Garcia X, Misra V, Robinson K, Fedus L, Zhou D, Ippolito D, Luan D, Lim H, Zoph B, Spiridonov A, Sepassi R, Dohan D, Agrawal S, Omernick M, Dai A M, Pillai T S, Pellat M, Lewkowycz A, Moreira E, Child R, Polozov O, Lee K, Zhou Z W, Wang X Z, Saeta B, Diaz M, Firat O, Catasta M, Wei J, Meier-Hellstern K, Eck D, Dean J, Petrov S and Fiedel N. 2023. PaLM: scaling language modeling with pathways. Journal of Machine Learning Research, 24(240): 1-113
Dai Z H, Liu H X, Le Q V and Tan M X. 2021. CoAtNet: marrying convolution and attention for all data sizes//Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual: Curran Associates Inc.: 3965-3977
Dehghani M, Djolonga J, Mustafa B, Padlewski P, Heek J, Gilmer J, Steiner A, Caron M, Geirhos R, Alabdulmohsin I, Jenatton R, Beyer L, Tschannen M, Arnab A, Wang X, Riquelme C, Minderer M, Puigcerver J, Evci U, Kumar M, van Steenkiste S, Elsayed G F, Mahendran A, Yu F, Oliver A, Huot F, Bastings J, Collier M P, Gritsenko A A, Birodkar V, Vasconcelos C, Tay Y, Mensink T, Kolesnikov A, Pavetić F, Tran D, Kipf T, Lučić M, Zhai X H, Keysers D, Harmsen J and Houlsby N. 2023. Scaling vision Transformers to 22 billion parameters//Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: JMLR.org: 7480-7512
Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255 [DOI: 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848]
Desai K and Johnson J. 2021. VirTex: learning visual representations from textual annotations//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 11157-11168 [DOI: 10.1109/CVPR46437.2021.01101http://dx.doi.org/10.1109/CVPR46437.2021.01101]
Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: pre-training of deep bidirectional Transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, USA: Association for Computational Linguistics: 4171-4186 [DOI: 10.18653/v1/n19-1423http://dx.doi.org/10.18653/v1/n19-1423]
Doersch C, Gupta A and Efros A A. 2015. Unsupervised visual representation learning by context prediction//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1422-1430 [DOI: 10.1109/ICCV.2015.167http://dx.doi.org/10.1109/ICCV.2015.167]
Donahue J, Jia Y Q, Vinyals O, Hoffman J, Zhang N, Tzeng E and Darrell T. 2014. DeCAF: a deep convolutional activation feature for generic visual recognition//Proceedings of the 31st International Conference on Machine Learning. Beijing, China: JMLR.org: I-647-I-655
Dong L, Yang N, Wang W H, Wei F R, Liu X D, Wang Y, Gao J F, Zhou M and Hon H W. 2019. Unified language model pre-training for natural language understanding and generation//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 13063-13075
Dong X Y, Bao J M, Zhang T, Chen D D, Zhang W M, Yuan L, Chen D, Wen F, Yu N H and Guo B N. 2023. PeCo: perceptual codebook for BERT pre-training of vision Transformers//Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington, USA: AAAI Press: 552-560 [DOI: 10.1609/aaai.v37i1.25130http://dx.doi.org/10.1609/aaai.v37i1.25130]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth16×16 words: Transformers for image recognition at scale [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2010.11929.pdfhttp://arxiv.org/pdf/2010.11929.pdf
Duong L, Cohn T, Bird S and Cook P. 2015. Low resource dependency parsing: cross-lingual parameter sharing in a neural network parser//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Beijing, China: Association for Computational Linguistics: 845-850 [DOI: 10.3115/v1/P15-2139http://dx.doi.org/10.3115/v1/P15-2139]
El-Nouby A, Izacard G, Touvron H, Laptev I, Jégou H and Grave E. 2021. Are large-scale datasets necessary for self-supervised pre-training? [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2112.10740.pdfhttp://arxiv.org/pdf/2112.10740.pdf
Fang Y X, Dong L, Bao H B, Wang X G and Wei F R. 2023a. Corrupted image modeling for self-supervised visual pre-training [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2202.03382.pdfhttp://arxiv.org/pdf/2202.03382.pdf
Fang Y X, Sun Q, Wang X G, Huang T J, Wang X L and Cao Y. 2023b. EVA-02: a visual representation for neon genesis [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2303.11331.pdfhttp://arxiv.org/pdf/2303.11331.pdf
Fang Y X, Wang W, Xie B H, Sun Q, Wu L, Wang X G, Huang T J, Wang X L and Cao Y. 2023c. EVA: exploring the limits of masked visual representation learning at scale//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada: IEEE: 19358-19369 [DOI: 10.1109/CVPR52729.2023.01855http://dx.doi.org/10.1109/CVPR52729.2023.01855]
Fedus W, Zoph B and Shazeer N. 2022. Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1): 5232-5270
Gao Y, Bai H P, Jie Z Q, Ma J Y, Jia K and Liu W. 2020. MTL-NAS: task-agnostic neural architecture search towards general-purpose multi-task learning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 11540-11549 [DOI: 10.1109/CVPR42600.2020.01156http://dx.doi.org/10.1109/CVPR42600.2020.01156]
Goyal P, Caron M, Lefaudeux B, Xu M, Wang P C, Pai V, Singh M, Liptchinsky V, Misra I, Joulin A and Bojanowski P. 2021. Self-supervised pretraining of visual features in the wild [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2103.01988.pdfhttp://arxiv.org/pdf/2103.01988.pdf
Grill J B, Strub F, Altché F, Tallec C, Richemond P H, Buchatskaya E, Doersch C, Pires B Á, Guo Z D, Azar M G, Piot B, Kavukcuoglu K, Munos R and Valko M. 2020. Bootstrap your own latent a new approach to self-supervised learning//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 21271-21284
Gupta T, Kamath A, Kembhavi A and Hoiem D. 2022. Towards general purpose vision systems: an end-to-end task-agnostic vision-language architecture//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 16378-16388 [DOI: 10.1109/CVPR52688.2022.01591http://dx.doi.org/10.1109/CVPR52688.2022.01591]
Hadsell R, Chopra S and LeCun Y. 2006. Dimensionality reduction by learning an invariant mapping//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). New York, USA: IEEE: 1735-1742 [DOI: 10.1109/CVPR.2006.100http://dx.doi.org/10.1109/CVPR.2006.100]
He K M, Chen X L, Xie S N, Li Y H, Dollr P and Girshick R. 2022. Masked autoencoders are scalable vision learners//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 15979-15988 [DOI: 10.1109/CVPR52688.2022.01553http://dx.doi.org/10.1109/CVPR52688.2022.01553]
He K M, Fan H Q, Wu Y X, Xie S N and Girshick R. 2020. Momentum contrast for unsupervised visual representation learning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 9726-9735 [DOI: 10.1109/CVPR42600.2020.00975http://dx.doi.org/10.1109/CVPR42600.2020.00975]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8): 1735-1780 [DOI: 10.1162/neco.1997.9.8.1735http://dx.doi.org/10.1162/neco.1997.9.8.1735]
Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, de Laroussilhe Q, Gesmundo A, Attariyan M and Gelly S, 2019. Parameter-efficient transfer learning for NLP//Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR: 2790-2799
Jia C, Yang Y F, Xia Y, Chen Y T, Parekh Z, Pham H, Le Q V, Sung Y H, Li Z and Duerig T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR: 4904-4916
Jiang D and Ye M. 2023. Transformer network for cross-modal text-to-image person re-identification. Journal of Image and Graphics, 28(5): 1384-1395
姜定, 叶茫. 2023. 面向跨模态文本到图像行人重识别的Transformer网络. 中国图象图形学报, 28(5): 1384-1395 [DOI: 10.11834/jig.220620http://dx.doi.org/10.11834/jig.220620]
Joulin A, Van Der Maaten L, Jabri A and Vasilache N. 2016. Learning visual features from large weakly supervised data//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 67-84 [DOI: 10.1007/978-3-319-46478-7_5http://dx.doi.org/10.1007/978-3-319-46478-7_5]
Kolesnikov A, Beyer L, Zhai X H, Puigcerver J, Yung J, Gelly S and Houlsby N. 2020a. Big transfer (BiT): general visual representation learning [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/1912.11370.pdfhttp://arxiv.org/pdf/1912.11370.pdf
Kolesnikov A, Beyer L, Zhai X H, Puigcerver J, Yung J, Gelly S and Houlsby N. 2020b. Big transfer (BiT): general visual representation learning//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 491-507 [DOI: 10.1007/978-3-030-58558-7_29http://dx.doi.org/10.1007/978-3-030-58558-7_29]
Kornblith S, Shlens J and Le Q V. 2019. Do better ImageNet models transfer better?//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2656-2666 [DOI: 10.1109/CVPR.2019.00277http://dx.doi.org/10.1109/CVPR.2019.00277]
Krizhevsky A, Sutskever I and Hinton G E. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84-90 [DOI: 10.1145/3065386http://dx.doi.org/10.1145/3065386]
Kudugunta S, Huang Y P, Bapna A, Krikun M, Lepikhin D, Luong M T and Firat O. 2021. Beyond distillation: task-level mixture-of-experts for efficient inference//Findings of the Association for Computational Linguistics: EMNLP 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics: 3577-3599 [DOI: 10.18653/V1/2021.FINDINGS-EMNLP.304http://dx.doi.org/10.18653/V1/2021.FINDINGS-EMNLP.304]
Lepikhin D, Lee H, Xu Y Z, Chen D H, Firat O, Huang Y P, Krikun M, Shazeer N and Chen Z F. 2020. GShard: scaling giant models with conditional computation and automatic sharding [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2006.16668.pdfhttp://arxiv.org/pdf/2006.16668.pdf
Lester B, Al-Rfou R and Constant N. 2021. The power of scale for parameter-efficient prompt tuning//Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics: 3045-3059 [DOI: 10.18653/V1/2021.EMNLP-MAIN.243http://dx.doi.org/10.18653/V1/2021.EMNLP-MAIN.243]
Li A, Jabri A, Joulin A and van der Maaten L. 2017. Learning visual N-grams from web data//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4193-4202 [DOI: 10.1109/ICCV.2017.449http://dx.doi.org/10.1109/ICCV.2017.449]
Li L H, Zhang P C, Zhang H T, Yang J W, Li C Y, Zhong Y W, Wang L J, Yuan L, Zhang L, Hwang J N, Chang K W and Gao J F. 2022. Grounded language-image pre-training//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 10955-10965 [DOI: 10.1109/CVPR52688.2022.01069http://dx.doi.org/10.1109/CVPR52688.2022.01069]
Lin J Y, Men R, Yang A, Zhou C, Ding M, Zhang Y C, Wang P, Wang A, Jiang L, Jia X Y, Zhang J, Zhang J W, Zou X, Li Z K, Deng X D, Liu J, Xue J B, Zhou H L, Ma J X, Yu J, Li Y, Lin W, Zhou J R, Tang J and Yang H X. 2021. M6: a Chinese multimodal pretrainer [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2103.00823.pdfhttp://arxiv.org/pdf/2103.00823.pdf
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu P F, Qiu X P and Huang X J. 2016. Recurrent neural network for text classification with multi-task learning//Proceedings of the 25th International Joint Conference on Artificial Intelligence. New York, USA: AAAI Press: 2873-2879
Liu S K, Fan L X, Johns E, Yu Z D, Xiao C W and Anandkumar A. 2024. Prismer: a vision-language model with an ensemble of experts [EB/OL]. [2024-01-07]. https://doi.org/10.48550/arXiv.2303.02506https://doi.org/10.48550/arXiv.2303.02506
Liu X D, He P C, Chen W Z and Gao J F. 2019. Multi-task deep neural networks for natural language understanding//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics: 4487-4496 [DOI: 10.18653/V1/P19-1441http://dx.doi.org/10.18653/V1/P19-1441]
Liu Z, Hu H, Lin Y T, Yao Z L, Xie Z D, Wei Y X, Ning J, Cao Y, Zhang Z, Dong L, Wei F R and Guo B N. 2022. Swin Transformer V2: scaling up capacity and resolution//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 11999-12009 [DOI: 10.1109/CVPR52688.2022.01170http://dx.doi.org/10.1109/CVPR52688.2022.01170]
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021. Swin Transformer: hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 9992-10002 [DOI: 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986]
Long M S, Cao Z J, Wang J M and Yu P S. 2017. Learning multiple tasks with multilinear relationship networks//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 1593-1602
Lu J S, Clark C, Zellers R, Mottaghi R and Kembhavi A. 2023. UNIFIED-IO: a unified model for vision, language, and multi-modal tasks//Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: OpenReview.net
Lu Y X, Kumar A, Zhai S F, Cheng Y, Javidi T and Feris R. 2017. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1131-1140 [DOI: 10.1109/CVPR.2017.126http://dx.doi.org/10.1109/CVPR.2017.126]
Mahajan D, Girshick R, Ramanathan V, He K M, Paluri M, Li Y X, Bharambe A and van der Maaten L. 2018. Exploring the limits of weakly supervised pretraining//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 185-201 [DOI: 10.1007/978-3-030-01216-8_12http://dx.doi.org/10.1007/978-3-030-01216-8_12]
Misra I, Shrivastava A, Gupta A and Hebert M. 2016. Cross-stitch networks for multi-task learning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 3994-4003 [DOI: 10.1109/CVPR.2016.433http://dx.doi.org/10.1109/CVPR.2016.433]
Mori Y, Takahashi H and Oka R. 1999. Image-to-word transformation based on dividing and vector quantizing images with words [EB/OL]. [2024-01-07]. https://www.semanticscholar.org/paper/Image-to-word-transformation-based-on-dividing-and-Mori-Takahashi/8b29ffb4207435540ddecf4b14a8a32106b33830https://www.semanticscholar.org/paper/Image-to-word-transformation-based-on-dividing-and-Mori-Takahashi/8b29ffb4207435540ddecf4b14a8a32106b33830
Noroozi M and Favaro P. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles//Proceedings of the 14th European Conference on Computer Vision—ECCV 2016. Amsterdam, the Netherlands: Springer: 69-84 [DOI: 10.1007/978-3-319-46466-4_5].
Ordonez V, Kulkarni G and Berg T L. 2011. Im2Text: describing images using 1 million captioned photographs//Proceedings of the 24th International Conference on Neural Information Processing Systems. Granada, Spain: Curran Associates Inc.: 1143-1151
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A and Tran D. 2018. Image Transformer//Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR: 4055-4064
Patterson D, Gonzalez J, Le Q V, Liang C, Munguia L M, Rothchild D, So D, Texier M and Dean J. 2021. Carbon emissions and large neural network training [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2104.10350.pdfhttp://arxiv.org/pdf/2104.10350.pdf
Peng X Y, Wang K, Zhu Z, Wang M and You Y. 2022a. Crafting better contrastive views for Siamese representation learning//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 16010-16019 [DOI: 10.1109/CVPR52688.2022.01556http://dx.doi.org/10.1109/CVPR52688.2022.01556]
Peng Z L, Dong L, Bao H B, Ye Q X and Wei F R. 2022b. BEiT v2: masked image modeling with vector-quantized visual Tokenizers [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2208.06366.pdfhttp://arxiv.org/pdf/2208.06366.pdf
Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K and Zettlemoyer L. 2018. Deep contextualized word representations//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, USA: Association for Computational Linguistics: 2227-2237 [DOI: 10.18653/V1/N18-1202http://dx.doi.org/10.18653/V1/N18-1202]
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR: 8748-8763
Radford A and Narasimhan K. 2018. Improving language understanding by generative pre-training [EB/OL]. [2024-01-07]. https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035
Radford A, Wu J, Child R, Luan D, Amodei D and Sutskever I. 2019. Language models are unsupervised multitask learners [EB/OL]. [2024-01-07]. https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfehttps://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W and Liu P J. 2023. Exploring the limits of transfer learning with a unified text-to-text Transformer [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/1910.10683.pdfhttp://arxiv.org/pdf/1910.10683.pdf
Razavian A S, Azizpour H, Sullivan J and Carlsson S. 2014. CNN features off-the-shelf: an astounding baseline for recognition//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Columbus, USA: IEEE: 512-519 [DOI: 10.1109/CVPRW.2014.131http://dx.doi.org/10.1109/CVPRW.2014.131]
Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 779-788 [DOI: 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91]
Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/1804.02767.pdfhttp://arxiv.org/pdf/1804.02767.pdf
Riquelme C, Puigcerver J, Mustafa B, Neumann M, Jenatton R, Pinto A S, Keysers D and Houlsby N. 2021. Scaling vision with sparse mixture of experts//Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual: Curran Associates Inc.: 8583-8595
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211-252 [DOI: 10.1007/s11263-015-0816-yhttp://dx.doi.org/10.1007/s11263-015-0816-y]
Sariyildiz M B, Perez J and Larlus D. 2020. Learning visual representations with caption annotations//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 153-170 [DOI: 10.1007/978-3-030-58598-3_10http://dx.doi.org/10.1007/978-3-030-58598-3_10]
Schlemper J, Oktay O, Schaap M, Heinrich M, Kainz B, Glocker B and Rueckert D. 2019. Attention gated networks: learning to leverage salient regions in medical images. Medical Image Analysis, 53: 197-207 [DOI: 10.1016/j.media.2019.01.012http://dx.doi.org/10.1016/j.media.2019.01.012]
Shao J, Chen S Y, Li Y G, Wang K, Yin Z F, He Y, Teng J N, Sun Q H, Gao M Y, Liu J H, Huang G S, Song G L, Wu Y C, Huang Y M, Liu F G, Peng H, Qin S, Wang C Y, Wang Y J, He C H, Liang D, Liu Y, Yu F W, Yan J J, Lin D H, Wang X G and Qiao Y. 2022. INTERN: a new learning paradigm towards general vision [EB/OL]. [2022-09-20]. http://arxiv.org/pdf/2111.08687.pdfhttp://arxiv.org/pdf/2111.08687.pdf
Shao S, Li Z M, Zhang T Y, Peng C, Yu G, Zhang X Y, Li J and Sun J. 2019. Objects365: a large-scale, high-quality dataset for object detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 8429-8438 [DOI: 10.1109/ICCV.2019.00852http://dx.doi.org/10.1109/ICCV.2019.00852]
Sharma P, Ding N, Goodman S and Soricut R. 2018. Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics: 2556-2565 [DOI: 10.18653/v1/P18-1238http://dx.doi.org/10.18653/v1/P18-1238]
Shi Z H, Li C J, Zhou L, Zhang Z J, Wu C W, You Z Z and Ren W Q. 2023. Survey on Transformer for image classification. Journal of Image and Graphics, 28(9): 2661-2692
石争浩, 李成建, 周亮, 张治军, 仵晨伟, 尤珍臻, 任文琦. 2023. Transformer驱动的图像分类研究进展. 中国图象图形学报, 28(9): 2661-2692 [DOI: 10.11834/jig.220799http://dx.doi.org/10.11834/jig.220799]
Strubell E, Ganesh A and McCallum A. 2019. Energy and policy considerations for deep learning in NLP//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics: 3645-3650 [DOI: 10.18653/V1/P19-1355http://dx.doi.org/10.18653/V1/P19-1355]
Subramanian S, Trischler A, Bengio Y and Pal C J. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/1804.00079.pdfhttp://arxiv.org/pdf/1804.00079.pdf
Sun Q, Fang Y X, Wu L, Wang X L and Cao Y. 2023. EVA-CLIP: improved training techniques for CLIP at scale [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2303.15389.pdfhttp://arxiv.org/pdf/2303.15389.pdf
Tian Y J, Xie L X, Wang Z Z, Wei L H, Zhang X P, Jiao J B, Wang Y W, Tian Q and Ye Q X. 2023. Integrally pre-trained Transformer pyramid networks//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 18610-18620 [DOI: 10.1109/CVPR52729.2023.01785http://dx.doi.org/10.1109/CVPR52729.2023.01785]
Tian Y L, Sun C, Poole B, Krishnan D, Schmid C and Isola P. 2020. What makes for good views for contrastive learning?//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 6827-6839
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A and Jegou H. 2021. Training data-efficient image Transformers and distillation through attention//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR: 10347-10357
van den Oord A, Li Y Z and Vinyals O. 2019. Representation learning with contrastive predictive coding [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/1807.03748.pdfhttp://arxiv.org/pdf/1807.03748.pdf
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wang J F, Yang Z Y, Hu X W, Li L J, Lin K, Gan Z, Liu Z C, Liu C and Wang L J. 2022a. GIT: a generative image-to-text Transformer for vision and language [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2205.14100.pdfhttp://arxiv.org/pdf/2205.14100.pdf
Wang P, Yang A, Men R, Lin J Y, Bai S, Li Z K, Ma J X, Zhou C, Zhou J R and Yang H X. 2022b. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework//Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR: 23318-23340
Wang W H, Bao H B, Dong L, Bjorck J, Peng Z L, Liu Q, Aggarwal K, Mohammed O K, Singhal S, Som S and Wei F R. 2022c. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2208.10442.pdfhttp://arxiv.org/pdf/2208.10442.pdf
Wang W H, Dai J F, Chen Z, Huang Z H, Li Z Q, Zhu X Z, Hu X W, Lu T, Lu L W, Li H S, Wang X G and Qiao Y. 2023. InternImage: exploring large-scale vision foundation models with deformable convolutions//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 14408-14419 [DOI: 10.1109/CVPR52729.2023.01385http://dx.doi.org/10.1109/CVPR52729.2023.01385]
Wei C, Fan H Q, Xie S N, Wu C Y, Yuille A and Feichtenhofer C. 2022a. Masked feature prediction for self-supervised visual pre-training//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 14648-14658 [DOI: 10.1109/CVPR52688.2022.01426http://dx.doi.org/10.1109/CVPR52688.2022.01426]
Wei L H, Xie L X, Zhou W G, Li H Q and Tian Q. 2022b. MVP: multimodality-guided visual pre-training//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 337-353 [DOI: 10.1007/978-3-031-20056-4_20http://dx.doi.org/10.1007/978-3-031-20056-4_20]
Wu H P, Xiao B, Codella N, Liu M C, Dai X Y, Yuan L and Zhang L. 2021. CvT: introducing convolutions to vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 22-31 [DOI: 10.1109/ICCV48922.2021.00009http://dx.doi.org/10.1109/ICCV48922.2021.00009]
Wu Z R, Xiong Y J, Yu S X and Lin D H. 2018. Unsupervised feature learning via non-parametric instance discrimination//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3733-3742 [DOI: 10.1109/CVPR.2018.00393http://dx.doi.org/10.1109/CVPR.2018.00393]
Xie E Z, Wang W H, Yu Z D, Anandkumar A, Alvarez J M and Luo P. 2021. SegFormer: simple and efficient design for semantic segmentation with Transformers//Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual: Curran Associates Inc.: 12077-12090
Xie Q Z, Luong M T, Hovy E and Le Q V. 2020. Self-training with noisy student improves ImageNet classification//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10684-10695 [DOI: 10.1109/CVPR42600.2020.01070http://dx.doi.org/10.1109/CVPR42600.2020.01070]
Xue L T, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A and Raffel C. 2021. mT5: a massively multilingual pre-trained text-to-text Transformer//Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics: 483-498 [DOI: 10.18653/v1/2021.naacl-main.41http://dx.doi.org/10.18653/v1/2021.naacl-main.41]
Yang Z Y, Gan Z, Wang J F, Hu X W, Ahmed F, Liu Z C, Lu Y M and Wang L J. 2022. UniTAB: unifying text and box outputs for grounded vision-language modeling//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 521-539 [DOI: 10.1007/978-3-031-20059-5_30http://dx.doi.org/10.1007/978-3-031-20059-5_30]
Yao L W, Huang R H, Hou L, Lu G S, Niu M Z, Xu H, Liang X D, Li Z G, Jiang X and Xu C J. 2022. FILIP: fine-grained interactive language-image pre-training//Proceedings of the 10th International Conference on Learning Representations. Virtual Event: OpenReview.net
Young P, Lai A, Hodosh M and Hockenmaier J. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67-78 [DOI: 10.1162/tacl_a_00166http://dx.doi.org/10.1162/tacl_a_00166]
Yu J H, Wang Z R, Vasudevan V, Yeung L, Seyedhosseini M and Wu Y H. 2022. CoCa: contrastive captioners are image-text foundation models [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2205.01917.pdfhttp://arxiv.org/pdf/2205.01917.pdf
Zeng W, Ren X Z, Su T, Wang H, Liao Y, Wang Z W, Jiang X, Yang Z Z, Wang K S, Zhang X D, Li C, Gong Z Y, Yao Y F, Huang X J, Wang J, Yu J F, Guo Q, Yu Y, Zhang Y, Wang J, Tao H T, Yan D S, Yi Z X, Peng F, Jiang F Q, Zhang H, Deng L F, Zhang Y H, Lin Z, Zhang C, Zhang S J, Guo M Y, Gu S Z, Fan G J, Wang Y W, Jin X F, Liu Q and Tian Y H. 2021. PanGu-α: large-scale autoregressive pretrained Chinese language models with auto-parallel computation [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2104.12369.pdfhttp://arxiv.org/pdf/2104.12369.pdf
Zhai X H, Kolesnikov A, Houlsby N and Beyer L. 2022a. Scaling vision Transformers//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 1204-1213 [DOI: 10.1109/CVPR52688.2022.01179http://dx.doi.org/10.1109/CVPR52688.2022.01179]
Zhai X H, Wang X, Mustafa B, Steiner A, Keysers D, Kolesnikov A and Beyer L. 2022b. LiT: zero-shot transfer with locked-image text tuning//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 18102-18112 [DOI: 10.1109/CVPR52688.2022.01759http://dx.doi.org/10.1109/CVPR52688.2022.01759]
Zhang H Y, Wang T B, Li M Z, Zhao Z, Pu S L and Wu F. 2022. Comprehensive review of visual-language-oriented multimodal pre-training methods. Journal of Image and Graphics, 27(9): 2652-2682
张浩宇, 王天保, 李孟择, 赵洲, 浦世亮, 吴飞. 2022. 视觉语言多模态预训练综述. 中国图象图形学报, 27(9): 2652-2682 [DOI: 10.11834/jig.220173http://dx.doi.org/10.11834/jig.220173]
Zhang X S, Tian Y J, Huang W, Ye Q X, Dai Q, Xie L X and Tian Q. 2022a. HiViT: hierarchical vision Transformer meets masked image modeling [EB/OL]. [2024-01-07]. http://arxiv.org/pdf/2205.14949.pdfhttp://arxiv.org/pdf/2205.14949.pdf
Zhang Y H, Jiang H, Miura Y, Manning C D and Langlotz C P. 2022b. Contrastive learning of medical visual representations from paired images and text//Proceedings of the 7th Machine Learning for Healthcare Conference. Durham, USA: PMLR: 2-25
Zheng S X, Lu J C, Zhao H S, Zhu X T, Luo Z K, Wang Y B, Fu Y W, Feng J F, Xiang T, Torr P H S and Zhang L. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 6877-6886 [DOI: 10.1109/CVPR46437.2021.00681http://dx.doi.org/10.1109/CVPR46437.2021.00681]
Zhou B L, Zhao H, Puig X, Fidler S, Barriuso A and Torralba A. 2017. Scene parsing through ADE20K dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5122-5130 [DOI: 10.1109/CVPR.2017.544http://dx.doi.org/10.1109/CVPR.2017.544]
Zhou K Y, Yang J K, Loy C C and Liu Z W. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337-2348 [DOI: 10.1007/s11263-022-01653-1http://dx.doi.org/10.1007/s11263-022-01653-1]
Zhu J G, Zhu X Z, Wang W H, Wang X H, Li H S, Wang X G and Dai J F. 2022. Uni-Perceiver-MoE: learning sparse generalist models with conditional MoEs//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 2664-2678
相关作者
相关机构