跨模态表征与生成技术

刘华峰; 陈静静; 李亮; 鲍秉坤; 李泽超; 刘家瑛; 聂礼强

doi:10.11834/jig.230035

智能交互与跨模态学习 | 浏览量 : 0 下载量: 3 CSCD: 3

PDF
导出
分享
收藏
专辑

跨模态表征与生成技术
Cross-modal representation learning and generation
2023年28卷第6期页码：1608-1629
纸质出版日期： 2023-06-16 ，
DOI： 10.11834/jig.230035
稿件说明：

移动端阅览

刘华峰，陈静静，李亮，鲍秉坤，李泽超，刘家瑛，聂礼强. 2023. 跨模态表征与生成技术. 中国图象图形学报， 28(06):1608-1629

Liu Huafeng， Chen Jingjing， Li Liang， Bao Bingkun， Li Zechao， Liu Jiaying， Nie Liqiang. 2023. Cross-modal representation learning and generation. Journal of Image and Graphics， 28(06):1608-1629
刘华峰，陈静静，李亮，鲍秉坤，李泽超，刘家瑛，聂礼强. 2023. 跨模态表征与生成技术. 中国图象图形学报， 28(06):1608-1629 DOI： 10.11834/jig.230035.

Liu Huafeng， Chen Jingjing， Li Liang， Bao Bingkun， Li Zechao， Liu Jiaying， Nie Liqiang. 2023. Cross-modal representation learning and generation. Journal of Image and Graphics， 28(06):1608-1629 DOI： 10.11834/jig.230035.

摘要

多媒体数据持续呈现爆发式增长并显现出异源异构的特性，因此跨模态学习领域研究逐渐引起学术和工业界的关注。跨模态表征与生成是跨模态学习的两大核心基础问题。跨模态表征旨在利用多种模态之间的互补性剔除模态之间的冗余，从而获得更为有效的特征表示；跨模态生成则是基于模态之间的语义一致性，实现不同模态数据形式上的相互转换，有助于提高不同模态间的迁移能力。本文系统地分析了国际与国内近年来跨模态表征与生成领域的重要研究进展，包括传统跨模态表征学习、多模态大模型表示学习、图像到文本的跨模态转换和跨模态图像生成。其中，传统跨模态表征学习探讨了跨模态统一表征和跨模态协同表征，多模态大模型表示学习探讨了基于Transformer的模型研究，图像到文本的跨模态转换探讨了图像视频的语义描述、视频字幕语义分析和视觉问答等领域的发展，跨模态图像生成从不同模态信息的跨模态联合表示方法、图像的跨模态生成技术和基于预训练的特定域图像生成阐述了跨模态生成方面的进展。本文详细综述了上述各个子领域研究的挑战性，对比了国内外研究方面的进展情况，梳理了发展脉络和学术研究的前沿动态。最后，根据上述分析展望了跨模态表征与生成的发展趋势和突破口。

Abstract

Nowadays， with the booming of multimedia data， the character of multi-source and multi-modality of data has become a challenging problem in multimedia research. Its representation and generation can be as two key factors in cross-modal learning research. Cross-modal representation studies feature learning and information integration methods using multi-modal data. To get more effective feature representation， multimodality-between mutual benefits are required to be strengthened. Cross-modal generation

is focused on the knowledge transfer mechanism across modalities. The modals-between semantic consistency can be used to realize data-interchangeable profiles of different modals. It is beneficial to improve modalities-between migrating ability. The literature review in cross-modal representation and generation are critically analyzed on the aspect of 1） traditional cross-modal representation learning， 2） big model for cross-modal representation learning， 3） image-to-text cross-modal conversion， joint representation， and 4） cross-modal image generation. Traditional cross-modal representation has two categories： joint representation and coordinated representation. Joint representation can yield multiple single-modal information to the joint representation space when each of single-modal information is processed through the coordinated representations， and cross-modal representations can be learnt mutually in terms of similarity constraints. Deep neural networks （DNNs） based self-supervised learning ability are activated to deal with large-scale unlabeled data， especially for the Transformer-based methods. To enrich the supervised learning paradigm， the pre-trained large models can yield large-scale unlabeled data to learn training， and a downstream tasks-derived small amount of labeled data is used for model fine-tuning. The pre-trained model has better versatility and transfering ability compared to the trained model for specific tasks， and the fine-tuned model can be used to optimize downstream tasks as well. The developmentof cross-modal synthesis （a.k.a. image caption or video caption） methods have been summarized， including end-to-end， semantic-based， and stylize-based methods. In addition， current situation of cross-modal conversion between image and text has beenanalyzed， including image caption， video caption， and visual question answering. The cross-modal generation methods are summarized as well in relevance to the joint representation of cross-modal information， image generation， text-image cross-modal generation， and cross-modal generation based on pre-trained models. In recent years， generative adversarial networks （GANs） and denoising diffusion probabilistic models （DDPMs） have been faciliating in cross-modal generation tasks. Thanks to the strong adaptability and generation ability of DDPM models， cross-modal generation research can be developed and the constraints of vulnerable textures are optimized to a certain extent. The growth of GAN-based and DDPM-based methods are summarized and analyzed further.

关键词

多媒体技术跨模态学习大模型跨模态表征跨模态生成深度学习

Keywords

multimedia technologycross-modal learningfoundation modelcross-modal representationcross-modal generationdeep learning

references

Andrew G， Arora R， Bilmes J and Livescu K. 2013. Deep canonical correlation analysis//Proceedings of the 30th International Conference on Machine Learning. Atlanta， USA： JMLR.org： 1247-1255

Arjovsky M， Chintala S and Bottou L. 2017. Wasserstein generative adversarial networks//Proceedings of the 34th International Conference on Machine Learning. Sydney， Australia： JMLR.org： 214-223

Baltrušaitis T， Ahuja C and Morency L P. 2019. Multimodal machine learning： a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence， 41（2）： 423-443 ［DOI： 10.1109/TPAMI.2018.2798607http://dx.doi.org/10.1109/TPAMI.2018.2798607］

Brown T B， Mann B， Ryder N， Subbiah M， Kaplan J， Dhariwal P， Neelakantan A， Shyam P， Sastry G， Askell A， Agarwal S， Herbert-Voss A， Krueger G， Henighan T， Child R， Ramesh A， Ziegler D M， Wu J， Winter C， Hesse C， Chen M， Sigler E， Litwin M， Gray S， Chess B， Clark J， Berner C， McCandlish S， Radford A， Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： 1877-1901

Cao Y， Long M S， Wang J M， Yang Q and Yu P S. 2016. Deep visual-semantic hashing for cross-modal retrieval//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco， USA： ACM： 1445-1454 ［DOI： 10.1145/2939672.2939812http://dx.doi.org/10.1145/2939672.2939812］

Chen S Z， Zhao Y D， Jin Q and Wu Q. 2020a. Fine-grained video-text retrieval with hierarchical graph reasoning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 10635-10644 ［DOI： 10.1109/CVPR42600.2020.01065http://dx.doi.org/10.1109/CVPR42600.2020.01065］

Chen Y C， Li L J， Yu L C， El Kholy A， Ahmed F， Gan Z， Cheng Y and Liu J J. 2020b. Uniter： universal image-text representation learning//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 104-120 ［DOI： 10.1007/978-3-030-58577-8_7http://dx.doi.org/10.1007/978-3-030-58577-8_7］

Cho J， Lei J， Tan H and Bansal M. 2021. Unifying vision-and-language tasks via text generation//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR： 1931-1942

Cho K， Van Merriënboer B， Gulçehre Ç， Bahdanau D， Bougares F， Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha， Qatar： Association for Computational Linguistics： 1724-1734 ［DOI： 10.3115/v1/D14-1179http://dx.doi.org/10.3115/v1/D14-1179］

Das R and Singh T D. 2022. Assamese news image caption generation using attention mechanism. Multimedia Tools and Applications， 81（7）： 10051-10069 ［DOI： 10.1007/s11042-022-12042-8http://dx.doi.org/10.1007/s11042-022-12042-8］

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Minnesota， USA： Association for Computational Linguistics： 4171-4186 ［DOI： 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423］

Dhariwal P and Nichol A. 2021. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems， 34： 8780-8794

Ding M， Yang Z Y， Hong W Y， Zheng W D， Zhou C， Yin D， Lin J Y， Zou X， Shao Z， Yang H X and Tang J. 2021. CogView： mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34： 19822-19835

Dong J F， Li X R， Xu C X， Ji S L， He Y， Yang G and Wang X. 2019. Dual encoding for zero-example video retrieval//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 9338-9347 ［DOI： 10.1109/CVPR.2019.00957http://dx.doi.org/10.1109/CVPR.2019.00957］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： transformers for image recognition at scale ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Frome A， Corrado G S， Shlens J， Bengio S， Dean J， Ranzato M and Mikolov T. 2013. DeViSE： a deep visual-semantic embedding model//Proceedings of the 26th International Conference on Neural Information Processing Systems. Nevada， USA： Curran Associates Inc.： 2121-2129

Gafni O， Polyak A， Ashual O， Sheynin S， Parikh D and Taigman Y. 2022. Make-A-scene： scene-based text-to-image generation with human priors//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 89-106 ［DOI： 10.1007/978-3-031-19784-0_6http://dx.doi.org/10.1007/978-3-031-19784-0_6］

Gal R， Patashnik O， Maron H， Bermano A H， Chechik G and Cohen-Or D. 2022. StyleGAN-NADA： CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics， 41（4）： #141 ［DOI： 10.1145/3528223.3530164http://dx.doi.org/10.1145/3528223.3530164］

Gao J Y， Sun C， Yang Z H and Nevatia R. 2017. Tall： temporal activity localization via language query//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 5277-5285 ［DOI： 10.1109/ICCV.2017.563http://dx.doi.org/10.1109/ICCV.2017.563］

Goodfellow I J， Pouget-Abadie J， Mirza M， Xu B， Warde-Farley D， Ozair S， Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal， Canada： MIT Press： 2672-2680

Gu J X， Meng X J， Lu G S， Hou L， Niu M Z， Liang X D， Yao L W， Huang R H， Zhang W， Jiang X， Xu C J and Xu H. 2022. Wukong： a 100 million large-scale Chinese cross-modal pre-training benchmark ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2202.06767.pdfhttps://arxiv.org/pdf/2202.06767.pdf

Gulrajani I， Ahmed F， Arjovsky M， Dumoulin V and Courville A C. 2017. Improved training of wasserstein GANs//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 5769-5779

Hendricks L A， Wang O， Shechtman E， Sivic J， Darrell T and Russell B. 2017. Localizing moments in video with natural language//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 5804-5813 ［DOI： 10.1109/ICCV.2017.618http://dx.doi.org/10.1109/ICCV.2017.618］

Hinton G E， Osindero S and Teh Y W. 2006. A fast learning algorithm for deep belief nets. Neural Computation， 18（7）： 1527-1554 ［DOI： 10.1162/neco.2006.18.7.1527http://dx.doi.org/10.1162/neco.2006.18.7.1527］

Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation， 9（8）： 1735-1780 ［DOI： 10.1162/neco.1997.9.8.1735http://dx.doi.org/10.1162/neco.1997.9.8.1735］

Hosseinzadeh M and Wang Y. 2021. Image change captioning by learning from an auxiliary task//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 2724-2733 ［DOI： 10.1109/CVPR46437.2021.00275http://dx.doi.org/10.1109/CVPR46437.2021.00275］

Hu Q T， Wu W Y， Feng G， Pan T F and Qiu K X. 2021. A study on interpretable analysis of multimodal learning behavior supported by deep learning learning. E-education Research， 42（11）： 77-83

胡钦太，伍文燕，冯广，潘庭锋，邱凯星. 2021. 深度学习支持下多模态学习行为可解释性分析研究. 电化教育研究， 42（11）： 77-83 ［DOI： 10.13811/j.cnki.eer.2021.11.011http://dx.doi.org/10.13811/j.cnki.eer.2021.11.011］

Huang H Y， Liang Y B， Duan N， Gong M， Shou L J， Jiang D X and Zhou M. 2019. Unicoder： a universal language encoder by pre-training with multiple cross-lingual tasks//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong， China： Association for Computational Linguistics： 2485-2494 ［DOI： 10.18653/v1/D19-1252http://dx.doi.org/10.18653/v1/D19-1252］

Huang Q B， Liang Y， Wei J L， Cai Y， Liang H Y， Leung H F and Li Q. 2021. Image difference captioning with instance-level fine-grained feature representation. IEEE Transactions on Multimedia， 24： 2004-2017 ［DOI： 10.1109/TMM.2021.3074803http://dx.doi.org/10.1109/TMM.2021.3074803］

Jhamtani H and Berg-Kirkpatrick T. 2018. Learning to describe differences between pairs of similar images//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels， Belgium： Association for Computational Linguistics： 4024-4034 ［DOI： 10.18653/v1/D18-1436http://dx.doi.org/10.18653/v1/D18-1436］

Jia C， Yang Y F， Xia Y， Chen Y T， Parekh Z， Pham H， Le Q， Sung Y H， Li Z and Duerig T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision//Proceedings of the 38th International Conference on Machine Learning. ［s.l.］：［s.n.］： 4904-4916

Jiang Q Y and Li W J. 2017. Deep cross-modal hashing//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 3270-3278 ［DOI： 10.1109/CVPR.2017.348http://dx.doi.org/10.1109/CVPR.2017.348］

Johnson J， Karpathy A and Li F F. 2016. DenseCap： fully convolutional localization networks for dense captioning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Las Vegas， USA： IEEE： 4565-4574 ［DOI： 10.1109/CVPR.2016.494http://dx.doi.org/10.1109/CVPR.2016.494］

Karras T， Aittala M， Laine S， Härkönen E， Hellsten J， Lehtinen J and Aila T. 2021. Alias-free generative adversarial networks ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2106.12423.pdfhttps://arxiv.org/pdf/2106.12423.pdf

Karras T， Laine S and Aila T. 2019. A style-based generator architecture for generative adversarial networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 4396-4405 ［DOI： 10.1109/CVPR.2019.00453http://dx.doi.org/10.1109/CVPR.2019.00453］

Karras T， Laine S， Aittala M， Hellsten J， Lehtinen J and Aila T. 2020. Analyzing and improving the image quality of StyleGAN//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 8107-8116 ［DOI： 10.1109/CVPR42600.2020.00813http://dx.doi.org/10.1109/CVPR42600.2020.00813］

Kavi P S， Pon K K， Kaliappan J， Selvaraj S K， Nagalakshmi R and Molla B. 2022. Caption generation based on emotions using CSPDenseNet and BiLSTM with self-attention. Applied Computational Intelligence and Soft Computing， 2022： #2756396 ［DOI： 10.1155/2022/2756396http://dx.doi.org/10.1155/2022/2756396］

Kim H， Kim J， Lee H， Park H and Kim G. 2021a. Viewpoint-agnostic change captioning with cycle consistency//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 2075-2084 ［DOI： 10.1109/ICCV48922.2021.00210http://dx.doi.org/10.1109/ICCV48922.2021.00210］

Kim W， Son B and Kim I. 2021b. ViLT： vision-and-language transformer without convolution or region supervision//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR： 5583-5594

Kim Y， Lee H and Provost E M. 2013. Deep learning for robust feature generation in audiovisual emotion recognition//Proceedings of 2013 IEEE International Conference on Acoustics， Speech and Signal Processing. Vancouver， Canada： IEEE： 3687-3691 ［DOI： 10.1109/ICASSP.2013.6638346http://dx.doi.org/10.1109/ICASSP.2013.6638346］

Kiros R， Salakhutdinov R and Zemel R S. 2014. Unifying visual-semantic embeddings with multimodal neural language models ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/1411.2539.pdfhttps://arxiv.org/pdf/1411.2539.pdf

Krishna R， Hata K， Ren F， Li F F and Niebles J C. 2017. Dense-captioning events in videos//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 706-715 ［DOI： 10.1109/ICCV.2017.83http://dx.doi.org/10.1109/ICCV.2017.83］

Lai P L and Fyfe C. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems， 10（5）： 365-377 ［DOI： 10.1142/S012906570000034Xhttp://dx.doi.org/10.1142/S012906570000034X］

Lei J， Yu L C， Berg T L and Bansal M. 2020. TVR： a large-scale dataset for video-subtitle moment retrieval//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 447-463 ［DOI： 10.1007/978-3-030-58589-1_27http://dx.doi.org/10.1007/978-3-030-58589-1_27］

Li B W， Qi X J， Lukasiewicz T and Torr P H S. 2020a. ManiGAN： text-guided image manipulation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 7877-7886 ［DOI： 10.1109/CVPR42600.2020.00790http://dx.doi.org/10.1109/CVPR42600.2020.00790］

Li C X and Harrison B. 2021. 3M： multi-style image caption generation using multi-modality features under multi-UPDOWN model ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2103.11186.pdfhttps://arxiv.org/pdf/2103.11186.pdf

Li C X and Harrison B. 2022. StyleM： stylized metrics for image captioning built with contrastive N-grams ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2201.00975.pdfhttps://arxiv.org/pdf/2201.00975.pdf

Li G， Duan N， Fang Y J， Gong M and Jiang D X. 2020b. Unicoder-VL： a universal encoder for vision and language by cross-modal pre-training. Proceedings of 2020 AAAI Conference on Artificial Intelligence， 34（7）： 11336-11344 ［DOI： 10.1609/aaai.v34i07.6795http://dx.doi.org/10.1609/aaai.v34i07.6795］

Li L H， Yatskar M， Yin D， Hsieh C J and Chang K W. 2019. VisualBERT： a simple and performant baseline for vision and language ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/1908.03557.pdfhttps://arxiv.org/pdf/1908.03557.pdf

Li L J， Chen Y C， Cheng Y， Gan Z， Yu L C and Liu J J. 2020c. Hero： hierarchical encoder for video+language omni-representation pre-training//Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. ［s.l.］： Association for Computational Linguistics： 2046-2065 ［DOI： 10.18653/v1/2020.emnlp-main.161http://dx.doi.org/10.18653/v1/2020.emnlp-main.161］

Liao L S. 2021. A Research on Image Description Based on Attention and Multi-level Vision Features. Shanghai： Shanghai University of Finance and Economics

廖雷双. 2021. 基于注意力机制与多层级视觉特征的图像描述方法研究. 上海：上海财经大学［DOI： 10.27296/d.cnki.gshcu.2021.001921http://dx.doi.org/10.27296/d.cnki.gshcu.2021.001921］

Liao Z M， Huang Q B， Liang Y， Fu M Y， Cai Y and Li Q. 2021. Scene graph with 3D information for change captioning//Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event， China： ACM： 5074-5082 ［DOI： 10.1145/3474085.3475712http://dx.doi.org/10.1145/3474085.3475712］

Lin J Y， Men R， Yang A， Zhou C， Zhang Y C， Wang P， Zhou J R， Tang J and Yang H X. 2021. M6： multi-modality-to-multi-modality multitask mega-transformer for unified pretraining//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Virtual Event， Singapore： ACM： 3251-3261 ［DOI： 10.1145/3447548.3467206http://dx.doi.org/10.1145/3447548.3467206］

Liu J， Zhu X X， Liu F， Guo L T， Zhao Z J， Sun M Z， Lu H Q， Wang W N， Lu H Q， Zhou S Y， Zhang J J and Wang J Q. 2021a. OPT： omni-perception pre-trainer for cross-modal understanding and generation ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2107.00249.pdfhttps://arxiv.org/pdf/2107.00249.pdf

Liu S， Ren Z and Yuan J S. 2021b. SibNet： sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（9）： 3259-3272 ［DOI： 10.1109/TPAMI.2019.2940007http://dx.doi.org/10.1109/TPAMI.2019.2940007］

Lu H Y， Fei N Y， Huo Y Q， Gao Y Z， Lu Z W and Wen J R. 2022. COTS： collaborative two-stream vision-language pre-training model for cross-modal retrieval//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 15671-15680 ［DOI： 10.1109/CVPR52688.2022.01524http://dx.doi.org/10.1109/CVPR52688.2022.01524］

Lu J S， Batra D， Parikh D and Lee S. 2019. ViLBERT： pretraining task-agnostic visiolinguistic representations for vision-and-language tasks//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： 13-23

Lu J S， Yang J W， Batra D and Parikh D. 2016. Hierarchical question-image co-attention for visual question answering//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona， Spain： Curran Associates Inc.： 289-297

Lugmayr A， Danelljan M， Romero A， Yu F， Timofte R and Van Gool L. 2022. RePaint： inpainting using denoising diffusion probabilistic models//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 11451-11461 ［DOI： 10.1109/CVPR52688.2022.01117http://dx.doi.org/10.1109/CVPR52688.2022.01117］

Luo H S， Ji L， Shi B T， Huang H Y， Duan N， Li T R， Li J， Bharti T and Zhou M. 2020. UniVL： a unified video and language pre-training model for multimodal understanding and generation ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2002.06353.pdfhttps://arxiv.org/pdf/2002.06353.pdf

Mirza M and Osindero S. 2014. Conditional generative adversarial nets ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/1411.1784.pdfhttps://arxiv.org/pdf/1411.1784.pdf

Ngiam J， Khosla A， Kim M， Nam J， Lee H and Ng A Y. 2011. Multimodal deep learning//Proceedings of the 28th International Conference on Machine Learning. Bellevue， USA： Omnipress： 689-696

Nguyen K， Tripathi S， Du B， Guha T and Nguyen T Q. 2021. In defense of scene graphs for image captioning//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 1387-1396 ［DOI： 10.1109/ICCV48922.2021.00144http://dx.doi.org/10.1109/ICCV48922.2021.00144］

Nichol A Q and Dhariwal P. 2021. Improved denoising diffusion probabilistic models//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR： 8162-8171

Nichol A Q， Dhariwal P， Ramesh A， Shyam P， Mishkin P， McGrew B， Sutskever I and Chen M. 2022. GLIDE： towards photorealistic image generation and editing with text-guided diffusion models//Proceedings of the 39th International Conference on Machine Learning. Baltimore， USA： PMLR： 16784-16804

Nie L Q， Qu L G， Meng D， Zhang M， Tian Q and del Bimbo A. 2022. Search-oriented micro-video captioning//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa， Portugal： ACM： 3234-3243 ［DOI： 10.1145/3503161.3548180http://dx.doi.org/10.1145/3503161.3548180］

Park D H， Darrell T and Rohrbach A. 2019. Robust change captioning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 4623-4632 ［DOI： 10.1109/ICCV.2019.00472http://dx.doi.org/10.1109/ICCV.2019.00472］

Patashnik O， Wu Z Z， Shechtman E， Cohen-Or D and Lischinski D. 2021. StyleCLIP： text-driven manipulation of StyleGAN imagery//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 2065-2074 ［DOI： 10.1109/ICCV48922.2021.00209http://dx.doi.org/10.1109/ICCV48922.2021.00209］

Qiao T T， Zhang J， Xu D Q and Tao D C. 2019. MirrorGAN： learning text-to-image generation by redescription//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 1505-1514 ［DOI： 10.1109/CVPR.2019.00160http://dx.doi.org/10.1109/CVPR.2019.00160］

Qiu Y， Yamamoto S， Nakashima K， Suzuki R， Iwata K， Kataoka H and Satoh Y. 2021. Describing and localizing multiple changes with transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 1951-1960 ［DOI： 10.1109/ICCV48922.2021.00198http://dx.doi.org/10.1109/ICCV48922.2021.00198］

Radford A， Kim J W， Hallacy C， Ramesh A， Goh G， Agarwal S， Sastry G， Askell A， Mishkin P， Clark J， Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. Edinburgh， Scotland： PMLR： 8748-8763

Ramesh A， Dhariwal P， Nichol A， Chu C and Chen M. 2022. Hierarchical text-conditional image generation with CLIP latents ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2204.06125.pdfhttps://arxiv.org/pdf/2204.06125.pdf

Ramesh A， Pavlov M， Goh G， Gray S， Voss C， Radford A， Chen M and Sutskever I. 2021. Zero-shot text-to-image generation//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR： 8821-8831

Rastegar S， Soleymani M S， Rabiee H R and Shojaee S M. 2016. MDL-CW： a multimodal deep learning framework with cross weights//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 2601-2609 ［DOI： 10.1109/CVPR.2016.285http://dx.doi.org/10.1109/CVPR.2016.285］

Reed S， Akata Z， Yan X C， Logeswaran L， Schiele B and Lee H. 2016. Generative adversarial text to image synthesis//Proceedings of the 33rd International Conference on Machine Learning. New York， USA： JMLR.org： 1060-1069

Ren S Q， He K M， Girshick R and Sun J. 2015. Faster R-CNN： towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal， Canada： MIT Press： 91-99

Saharia C， Chan W， Chang H W， Lee C， Ho J， Salimans T， Fleet D and Norouzi M. 2022a. Palette： image-to-image diffusion models//Proceedings of ACM SIGGRAPH 2022 Conference Proceedings. Vancouver， Canada： ACM： #15 ［DOI： 10.1145/3528233.3530757http://dx.doi.org/10.1145/3528233.3530757］

Saharia C， Ho J， Chan W， Salimans T， Fleet D J and Norouzi M. 2022b. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence： #320446 ［DOI： 10.1109/TPAMI.2022.3204461http://dx.doi.org/10.1109/TPAMI.2022.3204461］

Shen Z Q， Li J G， Su Z， Li M J， Chen Y R， Jiang Y G and Xue X Y. 2017. Weakly supervised dense video captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5159-5167 ［DOI： 10.1109/CVPR.2017.548http://dx.doi.org/10.1109/CVPR.2017.548］

Shi X X， Yang X， Gu J X， Joty S and Cai J F. 2020. Finding it at another side： a viewpoint-adapted matching encoder for change captioning//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 574-590 ［DOI： 10.1007/978-3-030-58568-6_34http://dx.doi.org/10.1007/978-3-030-58568-6_34］

Silberer C and Lapata M. 2014. Learning grounded meaning representations with autoencoder//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore， Maryland， USA： Association for Computational Linguistics： 721-732 ［DOI： 10.3115/v1/P14-1068http://dx.doi.org/10.3115/v1/P14-1068］

Sohl-Dickstein J， Weiss E A， Maheswaranathan N and Ganguli S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics//Proceedings of the 32nd International Conference on Machine Learning. Lille， France： JMLR.org： 2256-2265

Srivastava N and Salakhutdinov R. 2012. Learning representations for multimodal data with deep belief nets//Proceedings of 2012 International Conference on Machine Learning Workshop， Edinburgh， Scotland： PMLR： 1-8

Su W J， Zhu X Z， Cao Y， Li B， Lu L W， Wei F R and Dai J F. 2020. VL-BERT： pre-training of generic visual-linguistic representations//Proceedings of 2020 International Conference on Learning Representations. Addis Ababa， Ethiopia： ICLR： 1-6

Sun C， Myers A， Vondrick C， Murphy K and Schmid C. 2019. VideoBERT： a joint model for video and language representation learning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 7463-7472 ［DOI： 10.1109/ICCV.2019.00756http://dx.doi.org/10.1109/ICCV.2019.00756］

Tan H and Bansal M. 2019. LXMert： learning cross-modality encoder representations from transformers//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong， China： Association for Computational Linguistics： 5100-5111 ［DOI： 10.18653/v1/D19-1514http://dx.doi.org/10.18653/v1/D19-1514］

Tao M， Tang H， Wu F， Jing X Y， Bao B K and Xu C S. 2022. DF-GAN： a simple and effective baseline for text-to-image synthesis//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 16494-16504 ［DOI： 10.1109/CVPR52688.2022.01602http://dx.doi.org/10.1109/CVPR52688.2022.01602］

Tian F， Sun X Q， Liu F， Li T Y， Zhang L and Liu Z G. 2021. Chinese image caption with dual attention and multi-label image. Computer Systems and Applications， 30（7）： 32-40

田枫，孙小强，刘芳，李婷玉，张蕾，刘志刚. 2021. 融合双注意力与多标签的图像中文描述生成方法. 计算机系统应用， 30（7）： 32-40 ［DOI： 10.15888/j.cnki.csa.008010http://dx.doi.org/10.15888/j.cnki.csa.008010］

Tu Y B， Li L， Su L， Gao S X， Yan C G， Zha Z J， Yu Z T and Huang Q M. 2022. I2Transformer： intra- and inter-relation embedding transformer for TV show captioning. IEEE Transactions on Image Processing， 31： 3565-3577 ［DOI： 10.1109/TIP.2022.3159472http://dx.doi.org/10.1109/TIP.2022.3159472］

Tu Y B， Yao T T， Li L， Lou J D， Gao S X， Yu Z T and Yan C G. 2021. Semantic relation-aware difference representation learning for change captioning//Findings of the Association for Computational Linguistics： ACL-IJCNLP 2021. Virtual Event： Association for Computational Linguistics： 63-73 ［DOI： 10.18653/v1/2021.findings-acl.6http://dx.doi.org/10.18653/v1/2021.findings-acl.6］

van den Oord A， Vinyals O and Kavukcuoglu K. 2017. Neural discrete representation learning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6309-6318

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Venugopalan S， Rohrbach M， Donahue J， Mooney R， Darrell T and Saenko K. 2015. Sequence to sequence-video to text//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 4534-4542 ［DOI： 10.1109/ICCV.2015.515http://dx.doi.org/10.1109/ICCV.2015.515］

Wang C， Yang H J， Bartz C and Meinel C. 2016. Image captioning with deep bidirectional LSTMs//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam， the Netherlands： ACM： 988-997 ［DOI： 10.1145/2964284.2964299http://dx.doi.org/10.1145/2964284.2964299］

Wang J F， Yang Z Y， Hu X W， Li L J， Lin K， Gan Z， Liu Z C， Liu C and Wang L J. 2022. GIT： a generative image-to-text transformer for vision and language ［EB/OL］. ［2023-01-05］. https://arxiv.org/pdf/2205.14100.pdfhttps://arxiv.org/pdf/2205.14100.pdf

Wang L X， Shang C， Qiu H Q， Zhao T J， Qiu B L and Li H L. 2020. Multi-stage tag guidance network in video caption//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 4610-4614 ［DOI： 10.1145/3394171.3416288http://dx.doi.org/10.1145/3394171.3416288］

Wang W， Gao J Y， Yang X S and Xu C S. 2021. Learning coarse-to-fine graph neural networks for video-text retrieval. IEEE Transactions on Multimedia， 23： 2386-2397 ［DOI： 10.1109/tmm.2020.3011288http://dx.doi.org/10.1109/tmm.2020.3011288］

Wang X， Chen W H， Wu J W， Wang Y F and Wang W Y. 2018b. Video captioning via hierarchical reinforcement learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 4213-4222 ［DOI： 10.1109/CVPR.2018.00443http://dx.doi.org/10.1109/CVPR.2018.00443］

Wei X S， Song Y Z， Aodha O M， Wu J X， Peng Y X， Tang J H， Yang J and Belongie S. 2022. Fine-grained image analysis with deep learning： a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（12）： 8927-8948 ［DOI： 10.1109/TPAMI.2021.3126648http://dx.doi.org/10.1109/TPAMI.2021.3126648］

Weston J， Bengio S and Usunier N. 2010. Large scale image annotation： learning to rank with joint word-image embeddings. Machine Learning， 81（1）： 21-35 ［DOI： 10.1007/s10994-010-5198-3http://dx.doi.org/10.1007/s10994-010-5198-3］

Wu C F， Liu J L， Wang X J and Dong X. 2018. Object-difference attention： a simple relational attention for visual question answering//Proceedings of the 26th ACM International Conference on Multimedia. Seoul Korea （South）： ACM： 519-527 ［DOI： 10.1145/3240508.3240513http://dx.doi.org/10.1145/3240508.3240513］

Wu Z Z， Lischinski D and Shechtman E. 2021. StyleSpace analysis： disentangled controls for StyleGAN image generation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 12858-12867 ［DOI： 10.1109/CVPR46437.2021.01267http://dx.doi.org/10.1109/CVPR46437.2021.01267］

Xia W H， Yang Y J， Xue J H and Wu B Y. 2021. TediGAN： text-guided diverse face image generation and manipulation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 2256-2265 ［DOI： 10.1109/CVPR46437.2021.00229http://dx.doi.org/10.1109/CVPR46437.2021.00229］

Xie C W， Wu J M， Zheng Y， Pan P and Hua X S. 2022. Token embeddings alignment for cross-modal retrieval//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa， Portugal： ACM： 4555-4563 ［DOI： 10.1145/3503161.3548107http://dx.doi.org/10.1145/3503161.3548107］

Xiong Y L， Dai B and Lin D H. 2018. Move forward and tell： a progressive generator of video descriptions//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 489-505 ［DOI： 10.1007/978-3-030-01252-6_29http://dx.doi.org/10.1007/978-3-030-01252-6_29］

Xu H Y， Yan M， Li C L， Bi B， Huang S F， Xiao W M and Huang F. 2021. E2E-VLP： end-to-end vision-language pre-training enhanced by visual learning//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Online： Association for Computational Linguistics： 503-513 ［DOI： 10.18653/v1/2021.acl-long.42http://dx.doi.org/10.18653/v1/2021.acl-long.42］

Xu T， Zhang P C， Huang Q Y， Zhang H， Gan Z， Huang X L and He X D. 2018. AttnGAN： fine-grained text to image generation with attentional generative adversarial networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 1316-1324 ［DOI： 10.1109/CVPR.2018.00143http://dx.doi.org/10.1109/CVPR.2018.00143］

Xue H W， Hang T K， Zeng Y H， Sun Y C， Liu B， Yang H， Fu J L and Guo B N. 2022. Advancing high-resolution video-language representation with large-scale video transcriptions//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 5026-5035 ［DOI： 10.1109/CVPR52688.2022.00498http://dx.doi.org/10.1109/CVPR52688.2022.00498］

Yao L L， Wang W Y and Jin Q. 2022. Image difference captioning with pre-training and contrastive learning. Proceedings of 2022 AAAI Conference on Artificial Intelligence， 36（3）： 3108-3116 ［DOI： 10.1609/aaai.v36i3.20218http://dx.doi.org/10.1609/aaai.v36i3.20218］

You Q Z， Luo J B and Zhang Z Y. 2018. End-to-end convolutional semantic embeddings//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 5735-5744 ［DOI： 10.1109/CVPR.2018.00601http://dx.doi.org/10.1109/CVPR.2018.00601］

Yuan L， Chen D D， Chen Y L， Codella N， Dai X Y， Gao J F， Hu H D， Huang X D， Li B X， Li C Y， Liu C， Liu M C， Liu Z C， Lu Y M， Shi Y， Wang L J， Wang J F， Xiao B， Xiao Z， Yang J W， Zeng M， Zhou L W and Zhang P C. 2021. Florence： a new foundation model for computer vision ［EB/OL］. ［2021-11-22］. https://arxiv.org/pdf/2111.11432.pdfhttps://arxiv.org/pdf/2111.11432.pdf

Zhang H， Xu T， Li H S， Zhang S T， Wang X G， Huang X L and Metaxas D. 2017. StackGAN： text to photo-realistic image synthesis with stacked generative adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 5908-5916 ［DOI： 10.1109/ICCV.2017.629http://dx.doi.org/10.1109/ICCV.2017.629］

Zhang H， Xu T， Li H S， Zhang S T， Wang X G， Huang X L and Metaxas D N. 2019. StackGAN++： realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 41（8）： 1947-1962 ［DOI： 10.1109/TPAMI.2018.2856256http://dx.doi.org/10.1109/TPAMI.2018.2856256］

Zhang K W. 2021. Research on Chinese-Oriented Image Caption Generation Method. Harbin： Harbin Institute of Technology

张楷文. 2021. 面向中文的图像描述生成方法研究. 哈尔滨：哈尔滨工业大学［DOI： 10.27061/d.cnki.ghgdu.2021.003103http://dx.doi.org/10.27061/d.cnki.ghgdu.2021.003103］

Zhang Z J， Wu Q， Wang Y and Chen F. 2021. Exploring region relationships implicitly： image captioning with visual relationship attention. Image and Vision Computing， 109： #104146 ［DOI： 10.1016/J.IMAVIS.2021.104146http://dx.doi.org/10.1016/J.IMAVIS.2021.104146］

Zhang Z Q， Shi Y Y， Yuan C F， Li B， Wang P J， Hu W M and Zha Z J. 2020. Object relational graph with teacher-recommended learning for video captioning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 13275-13285 ［DOI： 10.1109/cvpr42600.2020.01329http://dx.doi.org/10.1109/cvpr42600.2020.01329］

Zhou L W， Palangi H， Zhang L， Hu H D， Corso J and Gao J F. 2020. Unified vision-language pre-training for image captioning and VQA//Proceedings of 2020 AAAI Conference on Artificial Intelligence， 34（7）： 13041-13049 ［DOI： 10.1609/AAAI.V34I07.7005http://dx.doi.org/10.1609/AAAI.V34I07.7005］

Zhou L W， Zhou Y B， Corso J J， Socher R and Xiong C M. 2018. End-to-end dense video captioning with masked transformer//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 8739-8748 ［DOI： 10.1109/CVPR.2018.00911http://dx.doi.org/10.1109/CVPR.2018.00911］

Zhu L C and Yang Y. 2020. ActBERT： learning global-local video-text representations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 8743-8752 ［DOI： 10.1109/cvpr42600.2020.00877http://dx.doi.org/10.1109/cvpr42600.2020.00877］

Zhu M F， Pan P B， Chen W and Yang Y. 2019. DM-GAN： dynamic memory generative adversarial networks for text-to-image synthesis//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 5795-5803 ［DOI： 10.1109/CVPR.2019.00595http://dx.doi.org/10.1109/CVPR.2019.00595］

文章被引用时，请邮件提醒。

提交

“三维视觉—语言”推理技术的前沿研究与最新趋势