Cross-modal representation learning and generation
- 2023年28卷第6期 页码:1608-1629
纸质出版日期: 2023-06-16
DOI: 10.11834/jig.230035
纸质出版日期: 2023-06-16 ,
刘华峰, 陈静静, 李亮, 鲍秉坤, 李泽超, 刘家瑛, 聂礼强. 2023. 跨模态表征与生成技术. 中国图象图形学报, 28(06):1608-1629
Liu Huafeng, Chen Jingjing, Li Liang, Bao Bingkun, Li Zechao, Liu Jiaying, Nie Liqiang. 2023. Cross-modal representation learning and generation. Journal of Image and Graphics, 28(06):1608-1629
Nowadays, with the booming of multimedia data, the character of multi-source and multi-modality of data has become a challenging problem in multimedia research. Its representation and generation can be as two key factors in cross-modal learning research. Cross-modal representation studies feature learning and information integration methods using multi-modal data. To get more effective feature representation, multimodality-between mutual benefits are required to be strengthened. Cross-modal generation
is focused on the knowledge transfer mechanism across modalities. The modals-between semantic consistency can be used to realize data-interchangeable profiles of different modals. It is beneficial to improve modalities-between migrating ability. The literature review in cross-modal representation and generation are critically analyzed on the aspect of 1) traditional cross-modal representation learning, 2) big model for cross-modal representation learning, 3) image-to-text cross-modal conversion, joint representation, and 4) cross-modal image generation. Traditional cross-modal representation has two categories: joint representation and coordinated representation. Joint representation can yield multiple single-modal information to the joint representation space when each of single-modal information is processed through the coordinated representations, and cross-modal representations can be learnt mutually in terms of similarity constraints. Deep neural networks (DNNs) based self-supervised learning ability are activated to deal with large-scale unlabeled data, especially for the Transformer-based methods. To enrich the supervised learning paradigm, the pre-trained large models can yield large-scale unlabeled data to learn training, and a downstream tasks-derived small amount of labeled data is used for model fine-tuning. The pre-trained model has better versatility and transfering ability compared to the trained model for specific tasks, and the fine-tuned model can be used to optimize downstream tasks as well. The developmentof cross-modal synthesis (a.k.a. image caption or video caption) methods have been summarized, including end-to-end, semantic-based, and stylize-based methods. In addition, current situation of cross-modal conversion between image and text has beenanalyzed, including image caption, video caption, and visual question answering. The cross-modal generation methods are summarized as well in relevance to the joint representation of cross-modal information, image generation, text-image cross-modal generation, and cross-modal generation based on pre-trained models. In recent years, generative adversarial networks (GANs) and denoising diffusion probabilistic models (DDPMs) have been faciliating in cross-modal generation tasks. Thanks to the strong adaptability and generation ability of DDPM models, cross-modal generation research can be developed and the constraints of vulnerable textures are optimized to a certain extent. The growth of GAN-based and DDPM-based methods are summarized and analyzed further.
multimedia technologycross-modal learningfoundation modelcross-modal representationcross-modal generationdeep learning
Andrew G, Arora R, Bilmes J and Livescu K. 2013. Deep canonical correlation analysis//Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA: JMLR.org: 1247-1255
Arjovsky M, Chintala S and Bottou L. 2017. Wasserstein generative adversarial networks//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR.org: 214-223
Baltrušaitis T, Ahuja C and Morency L P. 2019. Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2): 423-443 [DOI: 10.1109/TPAMI.2018.2798607http://dx.doi.org/10.1109/TPAMI.2018.2798607]
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 1877-1901
Cao Y, Long M S, Wang J M, Yang Q and Yu P S. 2016. Deep visual-semantic hashing for cross-modal retrieval//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA: ACM: 1445-1454 [DOI: 10.1145/2939672.2939812http://dx.doi.org/10.1145/2939672.2939812]
Chen S Z, Zhao Y D, Jin Q and Wu Q. 2020a. Fine-grained video-text retrieval with hierarchical graph reasoning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10635-10644 [DOI: 10.1109/CVPR42600.2020.01065http://dx.doi.org/10.1109/CVPR42600.2020.01065]
Chen Y C, Li L J, Yu L C, El Kholy A, Ahmed F, Gan Z, Cheng Y and Liu J J. 2020b. Uniter: universal image-text representation learning//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 104-120 [DOI: 10.1007/978-3-030-58577-8_7http://dx.doi.org/10.1007/978-3-030-58577-8_7]
Cho J, Lei J, Tan H and Bansal M. 2021. Unifying vision-and-language tasks via text generation//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR: 1931-1942
Cho K, Van Merriënboer B, Gulçehre Ç, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1724-1734 [DOI: 10.3115/v1/D14-1179http://dx.doi.org/10.3115/v1/D14-1179]
Das R and Singh T D. 2022. Assamese news image caption generation using attention mechanism. Multimedia Tools and Applications, 81(7): 10051-10069 [DOI: 10.1007/s11042-022-12042-8http://dx.doi.org/10.1007/s11042-022-12042-8]
Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minnesota, USA: Association for Computational Linguistics: 4171-4186 [DOI: 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423]
Dhariwal P and Nichol A. 2021. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34: 8780-8794
Ding M, Yang Z Y, Hong W Y, Zheng W D, Zhou C, Yin D, Lin J Y, Zou X, Shao Z, Yang H X and Tang J. 2021. CogView: mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34: 19822-19835
Dong J F, Li X R, Xu C X, Ji S L, He Y, Yang G and Wang X. 2019. Dual encoding for zero-example video retrieval//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9338-9347 [DOI: 10.1109/CVPR.2019.00957http://dx.doi.org/10.1109/CVPR.2019.00957]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words: transformers for image recognition at scale [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf
Frome A, Corrado G S, Shlens J, Bengio S, Dean J, Ranzato M and Mikolov T. 2013. DeViSE: a deep visual-semantic embedding model//Proceedings of the 26th International Conference on Neural Information Processing Systems. Nevada, USA: Curran Associates Inc.: 2121-2129
Gafni O, Polyak A, Ashual O, Sheynin S, Parikh D and Taigman Y. 2022. Make-A-scene: scene-based text-to-image generation with human priors//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 89-106 [DOI: 10.1007/978-3-031-19784-0_6http://dx.doi.org/10.1007/978-3-031-19784-0_6]
Gal R, Patashnik O, Maron H, Bermano A H, Chechik G and Cohen-Or D. 2022. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics, 41(4): #141 [DOI: 10.1145/3528223.3530164http://dx.doi.org/10.1145/3528223.3530164]
Gao J Y, Sun C, Yang Z H and Nevatia R. 2017. Tall: temporal activity localization via language query//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5277-5285 [DOI: 10.1109/ICCV.2017.563http://dx.doi.org/10.1109/ICCV.2017.563]
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 2672-2680
Gu J X, Meng X J, Lu G S, Hou L, Niu M Z, Liang X D, Yao L W, Huang R H, Zhang W, Jiang X, Xu C J and Xu H. 2022. Wukong: a 100 million large-scale Chinese cross-modal pre-training benchmark [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2202.06767.pdfhttps://arxiv.org/pdf/2202.06767.pdf
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V and Courville A C. 2017. Improved training of wasserstein GANs//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 5769-5779
Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T and Russell B. 2017. Localizing moments in video with natural language//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5804-5813 [DOI: 10.1109/ICCV.2017.618http://dx.doi.org/10.1109/ICCV.2017.618]
Hinton G E, Osindero S and Teh Y W. 2006. A fast learning algorithm for deep belief nets. Neural Computation, 18(7): 1527-1554 [DOI: 10.1162/neco.2006.18.7.1527http://dx.doi.org/10.1162/neco.2006.18.7.1527]
Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8): 1735-1780 [DOI: 10.1162/neco.1997.9.8.1735http://dx.doi.org/10.1162/neco.1997.9.8.1735]
Hosseinzadeh M and Wang Y. 2021. Image change captioning by learning from an auxiliary task//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 2724-2733 [DOI: 10.1109/CVPR46437.2021.00275http://dx.doi.org/10.1109/CVPR46437.2021.00275]
Hu Q T, Wu W Y, Feng G, Pan T F and Qiu K X. 2021. A study on interpretable analysis of multimodal learning behavior supported by deep learning learning. E-education Research, 42(11): 77-83
胡钦太, 伍文燕, 冯广, 潘庭锋, 邱凯星. 2021. 深度学习支持下多模态学习行为可解释性分析研究. 电化教育研究, 42(11): 77-83 [DOI: 10.13811/j.cnki.eer.2021.11.011http://dx.doi.org/10.13811/j.cnki.eer.2021.11.011]
Huang H Y, Liang Y B, Duan N, Gong M, Shou L J, Jiang D X and Zhou M. 2019. Unicoder: a universal language encoder by pre-training with multiple cross-lingual tasks//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics: 2485-2494 [DOI: 10.18653/v1/D19-1252http://dx.doi.org/10.18653/v1/D19-1252]
Huang Q B, Liang Y, Wei J L, Cai Y, Liang H Y, Leung H F and Li Q. 2021. Image difference captioning with instance-level fine-grained feature representation. IEEE Transactions on Multimedia, 24: 2004-2017 [DOI: 10.1109/TMM.2021.3074803http://dx.doi.org/10.1109/TMM.2021.3074803]
Jhamtani H and Berg-Kirkpatrick T. 2018. Learning to describe differences between pairs of similar images//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics: 4024-4034 [DOI: 10.18653/v1/D18-1436http://dx.doi.org/10.18653/v1/D18-1436]
Jia C, Yang Y F, Xia Y, Chen Y T, Parekh Z, Pham H, Le Q, Sung Y H, Li Z and Duerig T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision//Proceedings of the 38th International Conference on Machine Learning. [s.l.]: [s.n.]: 4904-4916
Jiang Q Y and Li W J. 2017. Deep cross-modal hashing//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 3270-3278 [DOI: 10.1109/CVPR.2017.348http://dx.doi.org/10.1109/CVPR.2017.348]
Johnson J, Karpathy A and Li F F. 2016. DenseCap: fully convolutional localization networks for dense captioning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 4565-4574 [DOI: 10.1109/CVPR.2016.494http://dx.doi.org/10.1109/CVPR.2016.494]
Karras T, Aittala M, Laine S, Härkönen E, Hellsten J, Lehtinen J and Aila T. 2021. Alias-free generative adversarial networks [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2106.12423.pdfhttps://arxiv.org/pdf/2106.12423.pdf
Karras T, Laine S and Aila T. 2019. A style-based generator architecture for generative adversarial networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 4396-4405 [DOI: 10.1109/CVPR.2019.00453http://dx.doi.org/10.1109/CVPR.2019.00453]
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J and Aila T. 2020. Analyzing and improving the image quality of StyleGAN//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE: 8107-8116 [DOI: 10.1109/CVPR42600.2020.00813http://dx.doi.org/10.1109/CVPR42600.2020.00813]
Kavi P S, Pon K K, Kaliappan J, Selvaraj S K, Nagalakshmi R and Molla B. 2022. Caption generation based on emotions using CSPDenseNet and BiLSTM with self-attention. Applied Computational Intelligence and Soft Computing, 2022: #2756396 [DOI: 10.1155/2022/2756396http://dx.doi.org/10.1155/2022/2756396]
Kim H, Kim J, Lee H, Park H and Kim G. 2021a. Viewpoint-agnostic change captioning with cycle consistency//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 2075-2084 [DOI: 10.1109/ICCV48922.2021.00210http://dx.doi.org/10.1109/ICCV48922.2021.00210]
Kim W, Son B and Kim I. 2021b. ViLT: vision-and-language transformer without convolution or region supervision//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR: 5583-5594
Kim Y, Lee H and Provost E M. 2013. Deep learning for robust feature generation in audiovisual emotion recognition//Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE: 3687-3691 [DOI: 10.1109/ICASSP.2013.6638346http://dx.doi.org/10.1109/ICASSP.2013.6638346]
Kiros R, Salakhutdinov R and Zemel R S. 2014. Unifying visual-semantic embeddings with multimodal neural language models [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/1411.2539.pdfhttps://arxiv.org/pdf/1411.2539.pdf
Krishna R, Hata K, Ren F, Li F F and Niebles J C. 2017. Dense-captioning events in videos//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 706-715 [DOI: 10.1109/ICCV.2017.83http://dx.doi.org/10.1109/ICCV.2017.83]
Lai P L and Fyfe C. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10(5): 365-377 [DOI: 10.1142/S012906570000034Xhttp://dx.doi.org/10.1142/S012906570000034X]
Lei J, Yu L C, Berg T L and Bansal M. 2020. TVR: a large-scale dataset for video-subtitle moment retrieval//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 447-463 [DOI: 10.1007/978-3-030-58589-1_27http://dx.doi.org/10.1007/978-3-030-58589-1_27]
Li B W, Qi X J, Lukasiewicz T and Torr P H S. 2020a. ManiGAN: text-guided image manipulation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7877-7886 [DOI: 10.1109/CVPR42600.2020.00790http://dx.doi.org/10.1109/CVPR42600.2020.00790]
Li C X and Harrison B. 2021. 3M: multi-style image caption generation using multi-modality features under multi-UPDOWN model [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2103.11186.pdfhttps://arxiv.org/pdf/2103.11186.pdf
Li C X and Harrison B. 2022. StyleM: stylized metrics for image captioning built with contrastive N-grams [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2201.00975.pdfhttps://arxiv.org/pdf/2201.00975.pdf
Li G, Duan N, Fang Y J, Gong M and Jiang D X. 2020b. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. Proceedings of 2020 AAAI Conference on Artificial Intelligence, 34(7): 11336-11344 [DOI: 10.1609/aaai.v34i07.6795http://dx.doi.org/10.1609/aaai.v34i07.6795]
Li L H, Yatskar M, Yin D, Hsieh C J and Chang K W. 2019. VisualBERT: a simple and performant baseline for vision and language [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/1908.03557.pdfhttps://arxiv.org/pdf/1908.03557.pdf
Li L J, Chen Y C, Cheng Y, Gan Z, Yu L C and Liu J J. 2020c. Hero: hierarchical encoder for video+language omni-representation pre-training//Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. [s.l.]: Association for Computational Linguistics: 2046-2065 [DOI: 10.18653/v1/2020.emnlp-main.161http://dx.doi.org/10.18653/v1/2020.emnlp-main.161]
Liao L S. 2021. A Research on Image Description Based on Attention and Multi-level Vision Features. Shanghai: Shanghai University of Finance and Economics
廖雷双. 2021. 基于注意力机制与多层级视觉特征的图像描述方法研究. 上海: 上海财经大学 [DOI: 10.27296/d.cnki.gshcu.2021.001921http://dx.doi.org/10.27296/d.cnki.gshcu.2021.001921]
Liao Z M, Huang Q B, Liang Y, Fu M Y, Cai Y and Li Q. 2021. Scene graph with 3D information for change captioning//Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event, China: ACM: 5074-5082 [DOI: 10.1145/3474085.3475712http://dx.doi.org/10.1145/3474085.3475712]
Lin J Y, Men R, Yang A, Zhou C, Zhang Y C, Wang P, Zhou J R, Tang J and Yang H X. 2021. M6: multi-modality-to-multi-modality multitask mega-transformer for unified pretraining//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Virtual Event, Singapore: ACM: 3251-3261 [DOI: 10.1145/3447548.3467206http://dx.doi.org/10.1145/3447548.3467206]
Liu J, Zhu X X, Liu F, Guo L T, Zhao Z J, Sun M Z, Lu H Q, Wang W N, Lu H Q, Zhou S Y, Zhang J J and Wang J Q. 2021a. OPT: omni-perception pre-trainer for cross-modal understanding and generation [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2107.00249.pdfhttps://arxiv.org/pdf/2107.00249.pdf
Liu S, Ren Z and Yuan J S. 2021b. SibNet: sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(9): 3259-3272 [DOI: 10.1109/TPAMI.2019.2940007http://dx.doi.org/10.1109/TPAMI.2019.2940007]
Lu H Y, Fei N Y, Huo Y Q, Gao Y Z, Lu Z W and Wen J R. 2022. COTS: collaborative two-stream vision-language pre-training model for cross-modal retrieval//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 15671-15680 [DOI: 10.1109/CVPR52688.2022.01524http://dx.doi.org/10.1109/CVPR52688.2022.01524]
Lu J S, Batra D, Parikh D and Lee S. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 13-23
Lu J S, Yang J W, Batra D and Parikh D. 2016. Hierarchical question-image co-attention for visual question answering//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc.: 289-297
Lugmayr A, Danelljan M, Romero A, Yu F, Timofte R and Van Gool L. 2022. RePaint: inpainting using denoising diffusion probabilistic models//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 11451-11461 [DOI: 10.1109/CVPR52688.2022.01117http://dx.doi.org/10.1109/CVPR52688.2022.01117]
Luo H S, Ji L, Shi B T, Huang H Y, Duan N, Li T R, Li J, Bharti T and Zhou M. 2020. UniVL: a unified video and language pre-training model for multimodal understanding and generation [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2002.06353.pdfhttps://arxiv.org/pdf/2002.06353.pdf
Mirza M and Osindero S. 2014. Conditional generative adversarial nets [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/1411.1784.pdfhttps://arxiv.org/pdf/1411.1784.pdf
Ngiam J, Khosla A, Kim M, Nam J, Lee H and Ng A Y. 2011. Multimodal deep learning//Proceedings of the 28th International Conference on Machine Learning. Bellevue, USA: Omnipress: 689-696
Nguyen K, Tripathi S, Du B, Guha T and Nguyen T Q. 2021. In defense of scene graphs for image captioning//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 1387-1396 [DOI: 10.1109/ICCV48922.2021.00144http://dx.doi.org/10.1109/ICCV48922.2021.00144]
Nichol A Q and Dhariwal P. 2021. Improved denoising diffusion probabilistic models//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR: 8162-8171
Nichol A Q, Dhariwal P, Ramesh A, Shyam P, Mishkin P, McGrew B, Sutskever I and Chen M. 2022. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models//Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR: 16784-16804
Nie L Q, Qu L G, Meng D, Zhang M, Tian Q and del Bimbo A. 2022. Search-oriented micro-video captioning//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa, Portugal: ACM: 3234-3243 [DOI: 10.1145/3503161.3548180http://dx.doi.org/10.1145/3503161.3548180]
Park D H, Darrell T and Rohrbach A. 2019. Robust change captioning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4623-4632 [DOI: 10.1109/ICCV.2019.00472http://dx.doi.org/10.1109/ICCV.2019.00472]
Patashnik O, Wu Z Z, Shechtman E, Cohen-Or D and Lischinski D. 2021. StyleCLIP: text-driven manipulation of StyleGAN imagery//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 2065-2074 [DOI: 10.1109/ICCV48922.2021.00209http://dx.doi.org/10.1109/ICCV48922.2021.00209]
Qiao T T, Zhang J, Xu D Q and Tao D C. 2019. MirrorGAN: learning text-to-image generation by redescription//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1505-1514 [DOI: 10.1109/CVPR.2019.00160http://dx.doi.org/10.1109/CVPR.2019.00160]
Qiu Y, Yamamoto S, Nakashima K, Suzuki R, Iwata K, Kataoka H and Satoh Y. 2021. Describing and localizing multiple changes with transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 1951-1960 [DOI: 10.1109/ICCV48922.2021.00198http://dx.doi.org/10.1109/ICCV48922.2021.00198]
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. Edinburgh, Scotland: PMLR: 8748-8763
Ramesh A, Dhariwal P, Nichol A, Chu C and Chen M. 2022. Hierarchical text-conditional image generation with CLIP latents [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2204.06125.pdfhttps://arxiv.org/pdf/2204.06125.pdf
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M and Sutskever I. 2021. Zero-shot text-to-image generation//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR: 8821-8831
Rastegar S, Soleymani M S, Rabiee H R and Shojaee S M. 2016. MDL-CW: a multimodal deep learning framework with cross weights//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2601-2609 [DOI: 10.1109/CVPR.2016.285http://dx.doi.org/10.1109/CVPR.2016.285]
Reed S, Akata Z, Yan X C, Logeswaran L, Schiele B and Lee H. 2016. Generative adversarial text to image synthesis//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: JMLR.org: 1060-1069
Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press: 91-99
Saharia C, Chan W, Chang H W, Lee C, Ho J, Salimans T, Fleet D and Norouzi M. 2022a. Palette: image-to-image diffusion models//Proceedings of ACM SIGGRAPH 2022 Conference Proceedings. Vancouver, Canada: ACM: #15 [DOI: 10.1145/3528233.3530757http://dx.doi.org/10.1145/3528233.3530757]
Saharia C, Ho J, Chan W, Salimans T, Fleet D J and Norouzi M. 2022b. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence: #320446 [DOI: 10.1109/TPAMI.2022.3204461http://dx.doi.org/10.1109/TPAMI.2022.3204461]
Shen Z Q, Li J G, Su Z, Li M J, Chen Y R, Jiang Y G and Xue X Y. 2017. Weakly supervised dense video captioning//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5159-5167 [DOI: 10.1109/CVPR.2017.548http://dx.doi.org/10.1109/CVPR.2017.548]
Shi X X, Yang X, Gu J X, Joty S and Cai J F. 2020. Finding it at another side: a viewpoint-adapted matching encoder for change captioning//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 574-590 [DOI: 10.1007/978-3-030-58568-6_34http://dx.doi.org/10.1007/978-3-030-58568-6_34]
Silberer C and Lapata M. 2014. Learning grounded meaning representations with autoencoder//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland, USA: Association for Computational Linguistics: 721-732 [DOI: 10.3115/v1/P14-1068http://dx.doi.org/10.3115/v1/P14-1068]
Sohl-Dickstein J, Weiss E A, Maheswaranathan N and Ganguli S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR.org: 2256-2265
Srivastava N and Salakhutdinov R. 2012. Learning representations for multimodal data with deep belief nets//Proceedings of 2012 International Conference on Machine Learning Workshop, Edinburgh, Scotland: PMLR: 1-8
Su W J, Zhu X Z, Cao Y, Li B, Lu L W, Wei F R and Dai J F. 2020. VL-BERT: pre-training of generic visual-linguistic representations//Proceedings of 2020 International Conference on Learning Representations. Addis Ababa, Ethiopia: ICLR: 1-6
Sun C, Myers A, Vondrick C, Murphy K and Schmid C. 2019. VideoBERT: a joint model for video and language representation learning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7463-7472 [DOI: 10.1109/ICCV.2019.00756http://dx.doi.org/10.1109/ICCV.2019.00756]
Tan H and Bansal M. 2019. LXMert: learning cross-modality encoder representations from transformers//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics: 5100-5111 [DOI: 10.18653/v1/D19-1514http://dx.doi.org/10.18653/v1/D19-1514]
Tao M, Tang H, Wu F, Jing X Y, Bao B K and Xu C S. 2022. DF-GAN: a simple and effective baseline for text-to-image synthesis//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 16494-16504 [DOI: 10.1109/CVPR52688.2022.01602http://dx.doi.org/10.1109/CVPR52688.2022.01602]
Tian F, Sun X Q, Liu F, Li T Y, Zhang L and Liu Z G. 2021. Chinese image caption with dual attention and multi-label image. Computer Systems and Applications, 30(7): 32-40
田枫, 孙小强, 刘芳, 李婷玉, 张蕾, 刘志刚. 2021. 融合双注意力与多标签的图像中文描述生成方法. 计算机系统应用, 30(7): 32-40 [DOI: 10.15888/j.cnki.csa.008010http://dx.doi.org/10.15888/j.cnki.csa.008010]
Tu Y B, Li L, Su L, Gao S X, Yan C G, Zha Z J, Yu Z T and Huang Q M. 2022. I2Transformer: intra- and inter-relation embedding transformer for TV show captioning. IEEE Transactions on Image Processing, 31: 3565-3577 [DOI: 10.1109/TIP.2022.3159472http://dx.doi.org/10.1109/TIP.2022.3159472]
Tu Y B, Yao T T, Li L, Lou J D, Gao S X, Yu Z T and Yan C G. 2021. Semantic relation-aware difference representation learning for change captioning//Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Virtual Event: Association for Computational Linguistics: 63-73 [DOI: 10.18653/v1/2021.findings-acl.6http://dx.doi.org/10.18653/v1/2021.findings-acl.6]
van den Oord A, Vinyals O and Kavukcuoglu K. 2017. Neural discrete representation learning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6309-6318
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T and Saenko K. 2015. Sequence to sequence-video to text//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4534-4542 [DOI: 10.1109/ICCV.2015.515http://dx.doi.org/10.1109/ICCV.2015.515]
Wang C, Yang H J, Bartz C and Meinel C. 2016. Image captioning with deep bidirectional LSTMs//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, the Netherlands: ACM: 988-997 [DOI: 10.1145/2964284.2964299http://dx.doi.org/10.1145/2964284.2964299]
Wang J F, Yang Z Y, Hu X W, Li L J, Lin K, Gan Z, Liu Z C, Liu C and Wang L J. 2022. GIT: a generative image-to-text transformer for vision and language [EB/OL]. [2023-01-05]. https://arxiv.org/pdf/2205.14100.pdfhttps://arxiv.org/pdf/2205.14100.pdf
Wang L X, Shang C, Qiu H Q, Zhao T J, Qiu B L and Li H L. 2020. Multi-stage tag guidance network in video caption//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM: 4610-4614 [DOI: 10.1145/3394171.3416288http://dx.doi.org/10.1145/3394171.3416288]
Wang W, Gao J Y, Yang X S and Xu C S. 2021. Learning coarse-to-fine graph neural networks for video-text retrieval. IEEE Transactions on Multimedia, 23: 2386-2397 [DOI: 10.1109/tmm.2020.3011288http://dx.doi.org/10.1109/tmm.2020.3011288]
Wang X, Chen W H, Wu J W, Wang Y F and Wang W Y. 2018b. Video captioning via hierarchical reinforcement learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4213-4222 [DOI: 10.1109/CVPR.2018.00443http://dx.doi.org/10.1109/CVPR.2018.00443]
Wei X S, Song Y Z, Aodha O M, Wu J X, Peng Y X, Tang J H, Yang J and Belongie S. 2022. Fine-grained image analysis with deep learning: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12): 8927-8948 [DOI: 10.1109/TPAMI.2021.3126648http://dx.doi.org/10.1109/TPAMI.2021.3126648]
Weston J, Bengio S and Usunier N. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1): 21-35 [DOI: 10.1007/s10994-010-5198-3http://dx.doi.org/10.1007/s10994-010-5198-3]
Wu C F, Liu J L, Wang X J and Dong X. 2018. Object-difference attention: a simple relational attention for visual question answering//Proceedings of the 26th ACM International Conference on Multimedia. Seoul Korea (South): ACM: 519-527 [DOI: 10.1145/3240508.3240513http://dx.doi.org/10.1145/3240508.3240513]
Wu Z Z, Lischinski D and Shechtman E. 2021. StyleSpace analysis: disentangled controls for StyleGAN image generation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 12858-12867 [DOI: 10.1109/CVPR46437.2021.01267http://dx.doi.org/10.1109/CVPR46437.2021.01267]
Xia W H, Yang Y J, Xue J H and Wu B Y. 2021. TediGAN: text-guided diverse face image generation and manipulation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 2256-2265 [DOI: 10.1109/CVPR46437.2021.00229http://dx.doi.org/10.1109/CVPR46437.2021.00229]
Xie C W, Wu J M, Zheng Y, Pan P and Hua X S. 2022. Token embeddings alignment for cross-modal retrieval//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa, Portugal: ACM: 4555-4563 [DOI: 10.1145/3503161.3548107http://dx.doi.org/10.1145/3503161.3548107]
Xiong Y L, Dai B and Lin D H. 2018. Move forward and tell: a progressive generator of video descriptions//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 489-505 [DOI: 10.1007/978-3-030-01252-6_29http://dx.doi.org/10.1007/978-3-030-01252-6_29]
Xu H Y, Yan M, Li C L, Bi B, Huang S F, Xiao W M and Huang F. 2021. E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics: 503-513 [DOI: 10.18653/v1/2021.acl-long.42http://dx.doi.org/10.18653/v1/2021.acl-long.42]
Xu T, Zhang P C, Huang Q Y, Zhang H, Gan Z, Huang X L and He X D. 2018. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1316-1324 [DOI: 10.1109/CVPR.2018.00143http://dx.doi.org/10.1109/CVPR.2018.00143]
Xue H W, Hang T K, Zeng Y H, Sun Y C, Liu B, Yang H, Fu J L and Guo B N. 2022. Advancing high-resolution video-language representation with large-scale video transcriptions//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 5026-5035 [DOI: 10.1109/CVPR52688.2022.00498http://dx.doi.org/10.1109/CVPR52688.2022.00498]
Yao L L, Wang W Y and Jin Q. 2022. Image difference captioning with pre-training and contrastive learning. Proceedings of 2022 AAAI Conference on Artificial Intelligence, 36(3): 3108-3116 [DOI: 10.1609/aaai.v36i3.20218http://dx.doi.org/10.1609/aaai.v36i3.20218]
You Q Z, Luo J B and Zhang Z Y. 2018. End-to-end convolutional semantic embeddings//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5735-5744 [DOI: 10.1109/CVPR.2018.00601http://dx.doi.org/10.1109/CVPR.2018.00601]
Yuan L, Chen D D, Chen Y L, Codella N, Dai X Y, Gao J F, Hu H D, Huang X D, Li B X, Li C Y, Liu C, Liu M C, Liu Z C, Lu Y M, Shi Y, Wang L J, Wang J F, Xiao B, Xiao Z, Yang J W, Zeng M, Zhou L W and Zhang P C. 2021. Florence: a new foundation model for computer vision [EB/OL]. [2021-11-22]. https://arxiv.org/pdf/2111.11432.pdfhttps://arxiv.org/pdf/2111.11432.pdf
Zhang H, Xu T, Li H S, Zhang S T, Wang X G, Huang X L and Metaxas D. 2017. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5908-5916 [DOI: 10.1109/ICCV.2017.629http://dx.doi.org/10.1109/ICCV.2017.629]
Zhang H, Xu T, Li H S, Zhang S T, Wang X G, Huang X L and Metaxas D N. 2019. StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8): 1947-1962 [DOI: 10.1109/TPAMI.2018.2856256http://dx.doi.org/10.1109/TPAMI.2018.2856256]
Zhang K W. 2021. Research on Chinese-Oriented Image Caption Generation Method. Harbin: Harbin Institute of Technology
张楷文. 2021. 面向中文的图像描述生成方法研究. 哈尔滨: 哈尔滨工业大学 [DOI: 10.27061/d.cnki.ghgdu.2021.003103http://dx.doi.org/10.27061/d.cnki.ghgdu.2021.003103]
Zhang Z J, Wu Q, Wang Y and Chen F. 2021. Exploring region relationships implicitly: image captioning with visual relationship attention. Image and Vision Computing, 109: #104146 [DOI: 10.1016/J.IMAVIS.2021.104146http://dx.doi.org/10.1016/J.IMAVIS.2021.104146]
Zhang Z Q, Shi Y Y, Yuan C F, Li B, Wang P J, Hu W M and Zha Z J. 2020. Object relational graph with teacher-recommended learning for video captioning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13275-13285 [DOI: 10.1109/cvpr42600.2020.01329http://dx.doi.org/10.1109/cvpr42600.2020.01329]
Zhou L W, Palangi H, Zhang L, Hu H D, Corso J and Gao J F. 2020. Unified vision-language pre-training for image captioning and VQA//Proceedings of 2020 AAAI Conference on Artificial Intelligence, 34(7): 13041-13049 [DOI: 10.1609/AAAI.V34I07.7005http://dx.doi.org/10.1609/AAAI.V34I07.7005]
Zhou L W, Zhou Y B, Corso J J, Socher R and Xiong C M. 2018. End-to-end dense video captioning with masked transformer//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8739-8748 [DOI: 10.1109/CVPR.2018.00911http://dx.doi.org/10.1109/CVPR.2018.00911]
Zhu L C and Yang Y. 2020. ActBERT: learning global-local video-text representations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8743-8752 [DOI: 10.1109/cvpr42600.2020.00877http://dx.doi.org/10.1109/cvpr42600.2020.00877]
Zhu M F, Pan P B, Chen W and Yang Y. 2019. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5795-5803 [DOI: 10.1109/CVPR.2019.00595http://dx.doi.org/10.1109/CVPR.2019.00595]