基于联合嵌入空间的视频文本检索研究综述

董闯; 栗伟; 巴聪; 覃文军

doi:10.11834/jig.240747

浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

基于联合嵌入空间的视频文本检索研究综述
A review of research on video-text retrieval based on joint embedding space
2025年页码：1-18
收稿日期：2024-12-08，

修回日期：2025-03-07，

录用日期：2025-03-24，

网络出版日期：2025-03-26，
DOI： 10.11834/jig.240747
稿件说明：

移动端阅览

董闯, 栗伟, 巴聪, 覃文军. 基于联合嵌入空间的视频文本检索研究综述[J/OL]. 中国图象图形学报, 2025,1-18. DOI： 10.11834/jig.240747.

Dong Chuang, Li Wei, Ba Cong, Tan Wenjun. A review of research on video-text retrieval based on joint embedding space[J/OL]. Journal of image and graphics, 2025, 1-18. DOI： 10.11834/jig.240747.

摘要

视频在人们日常生活中扮演重要角色，面对爆炸式增长的视频数据，视频文本检索为用户提供便捷的方式检索感兴趣的信息。视频文本检索旨在利用用户输入的文本或视频查询，在视频或文本库中检索出与输入内容最相关的视频或文本。对基于联合嵌入空间的视频文本检索工作进行系统梳理和综述，以便认识和理解视频文本检索的发展。首先从基于联合嵌入空间的视频文本检索的四个步骤：视频特征表示提取、文本特征表示提取、视频文本特征对齐以及目标函数出发，对现有工作进行分类分析，并阐述不同类型方法的优缺点。接着从实验的角度给出视频文本检索的基准数据集和评价指标，并在多个常用数据集上比较典型模型的性能。最后讨论视频文本检索的挑战及发展方向。

Abstract

With the advent of the big data era， video platforms such as YouTube， TikTok， and Kuaishou have gained popularity due to their rich video content. However， the explosive growth of data has made it difficult for users to retrieve content that interests them. Traditional unimodal retrieval relies on manual annotations， limiting flexibility and incurring high costs. Video Text Retrieval （VTR） addresses this issue by using deep learning to enable cross-modal retrieval between video and text， aiming to retrieve the most relevant content from a corresponding database based on either a text or video query. Early VTR methods relied on predefined concepts for retrieval， which lacked scalability. VTR based on joint embedding spaces has become mainstream， bridging modality differences through feature extraction and alignment， and holds significant value in both natural language processing and computer vision fields. It is widely used in sectors such as healthcare， social media， and short videos. Video text retrieval based on joint embedding spaces involves four key technologies： video and text feature representation extraction， video-text feature alignment， and the objective function. The goal of video feature representation extraction is to convert videos into feature vectors for better understanding by computers. This is mainly divided into two aspects： spatiotemporal features and multimodal features. Spatiotemporal features are achieved by extracting spatial information from video frames and modeling temporal information. Multimodal features involve integrating audio， subtitles， and motion information within the video to enhance video understanding. Methods based on multimodal features aggregate rich multimodal information， effectively improving retrieval performance. However， these methods have high dataset requirements and need a large amount of labeled data to extract features from various modalities. Furthermore， such methods lack intelligent multimodal fusion mechanisms， are unable to coordinate the relationships between different modalities， and still need improvement in retrieval efficiency. The goal of text feature representation extraction is to map high-dimensional discrete language sentences into low-dimensional dense feature representations， with the key being the effective modeling of sequential relationships within the text. Early methods used bag-of-words and word2vec to represent word embeddings， followed by RNNs and CNNs to model dependencies between words. Recently， the Transformer model， using its self-attention mechanism， has enabled parallel processing of text data and captured global dependency information， achieving breakthroughs in multiple benchmarks. It is now the most competitive method. Video-text feature alignment maps the feature representations of both video and text into a shared embedding space for similarity computation. Coarse-grained feature alignment is achieved by calculating global similarity， which is efficient but unable to capture subtle semantic differences. Fine-grained feature alignment， on the other hand， focuses on aligning local information by capturing low-level features at lower layers and semantic information at higher layers. It may also enhance model detail perception through explicit alignment， thereby improving retrieval accuracy. Objective functions include triplet loss and contrastive loss. Triplet loss optimizes the model by ensuring that the similarity between positive sample pairs is higher than that of negative sample pairs， but it is greatly influenced by the quality of negative samples and batch size. Contrastive loss increases the distance between positive sample pairs and shortens the distance between negative sample pairs. It does not require setting thresholds and is commonly used to optimize VTR models， overcoming some of the limitations of triplet loss. VTR models typically adopt a pretraining and fine-tuning strategy， where they are pretrained on large-scale image-text and video datasets and then fine-tuned on benchmark datasets specific to video-text retrieval. The benchmark datasets are summarized in terms of their quantity and duration. The evaluation metrics for testing the models include R@1， R@5， R@10， MdR （Median Rank）， and MnR （Mean Rank）. By comparing the test results of various models on typical datasets， several conclusions can be drawn. First， in the extraction of multimodal video feature representations， although existing methods extract and aggregate multimodal information using expert models， an excessive amount of modality information has not significantly improved model performance. In fact， it may introduce noise， highlighting the urgent need for intelligent modality fusion methods. Second， the extraction of spatiotemporal feature representations is crucial for model performance， particularly the advantages of the Transformer architecture in modeling spatial and temporal information. Future research will focus more on how to effectively relate time and space information to enhance video representation capabilities. Additionally， fine-grained information interaction can effectively improve model performance， but the complexity of the model structure makes optimization and implementation difficult. Therefore， more efficient fine-grained interaction methods need to be explored. Lastly， the ranking of different models on different datasets can vary， reflecting the influence of dataset differences and model structures. This indicates the need to develop VTR models with stronger generalization capabilities. Several challenges and future directions for video-text retrieval are also discussed. The first challenge is the lack of high-quality datasets， which limits model training. Existing datasets have limited evaluation of temporal information modeling， and there is an urgent need for standardized， high-quality datasets. Second， retrieval efficiency is often overlooked in existing methods. In large-scale video data retrieval， improving efficiency without sacrificing accuracy will be a major focus of future research. Third， scalable retrieval models remain a challenge. Current models require fine-tuning for each dataset， so future research will focus on leveraging the general knowledge of foundational models to improve their adaptability and transferability. Lastly， the exploration of unsupervised learning methods is becoming a trend， with future research focusing on how to optimize models using large amounts of unlabeled video data.

关键词

Keywords

references

Arandjelovic R ， Gronat P ， Torii A ， Pajdla T and Sivic J . 2016 . NetVLAD： CNN architecture for weakly supervised place recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， NV， USA ： IEEE ： 5297 – 5307 ［ DOI： 10.1109/CVPR.2016.572 http://dx.doi.org/10.1109/CVPR.2016.572 ］

Bain M ， Nagrani A ， Varol G and Zisserman A . 2021 . Frozen in time： a joint video and image encoder for end-to-end retrieval. 2021 IEEE/CVF International Conference on Computer Vision . Montreal， QC， Canada ： IEEE ： 1708 – 1718 ［ DOI： 10.1109/ICCV48922.2021.00175 http://dx.doi.org/10.1109/ICCV48922.2021.00175 ］

Bengio Y ， Simard P and Frasconi P . 1994 . Learning long-term dependencies with gradient descent is difficult . IEEE Transactions on Neural Networks ， 5 （ 2 ）： 157 – 166 ［ DOI： 10.1109/72.279181 http://dx.doi.org/10.1109/72.279181 ］

Bertasius G ， Wang H and Torresani L . 2021 . Is space-time attention all you need for video understanding？ Proceedings of the 38th International Conference on Machine Learning ， 139 ： 813 – 824

Bogolin S V ， Croitoru I ， Jin H ， Liu Y and Albanie S . 2022 . Cross modal retrieval with querybank normalisation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， LA， USA ： IEEE ： 5184 – 5195 ［ DOI： 10.1109/CVPR52688.2022.00513 http://dx.doi.org/10.1109/CVPR52688.2022.00513 ］

Carreira J and Zisserman A . 2017 . Quo vadis， action recognition？ a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， HI ： IEEE ： 6299 – 6308 ［ DOI： 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ］

Chen D and Dolan W . 2011 . Collecting highly parallel data for paraphrase evaluation. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics： Human Language Technologies . Portland， Oregon， USA ： Association for Computational Linguistics ： 190 – 200

Chen L ， Xi Y M and Liu L B . 2024 . Survey on video-text cross-modal retrieval . Computer Engineering and Applications ， 60 （ 4 ）： 1 - 20

陈磊，习怡萌，刘立波 . 2024 . 视频文本跨模态检索研究综述 . 计算机工程与应用， 60 （ 4 ）： 1 - 20 ［ DOI： 10.3778/j.issn.1002-8331.2306-0382 http://dx.doi.org/10.3778/j.issn.1002-8331.2306-0382 ］

Chen X ， Fang H ， Lin T Y ， Vedantam R ， Gupta S ， Dollar P and Zitnick C L . 2015 . Microsoft COCO captions： Data collection and evaluation server ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1504.00325.pdf http://arxiv.org/pdf/1504.00325.pdf

Chen Y ， Wang J ， Lin L ， Qi Z ， Ma J and Shan Y . 2023 . Tagging before alignment： Integrating multi-modal tags for video-text retrieval . Proceedings of the AAAI Conference on Artificial Intelligence ， 37 （ 1 ）： 396 – 404 ［ DOI： 10.1609/aaai.v37i1.25113 http://dx.doi.org/10.1609/aaai.v37i1.25113 ］

Cheng X ， Lin H ， Wu X ， Yang F and Shen D . 2021 . Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2109.04290.pdf http://arxiv.org/pdf/2109.04290.pdf

Cho K ， Van Merrienboer B ， Bahdanau D and Bengio Y . 2014 . On the properties of neural machine translation： encoder–decoder approaches. Proceedings of SSST-8， Eighth Workshop on Syntax， Semantics and Structure in Statistical Translation . Doha， Qatar ： Association for Computational Linguistics ： 103 – 111 ［ DOI： 10.3115/v1/W14-4012 http://dx.doi.org/10.3115/v1/W14-4012 ］

Cho K . 2014 . Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing . Doha， Qatar ： Association for Computational Linguistics ： 1724 – 1734 ［ DOI： 10.3115/v1/D14-1179 http://dx.doi.org/10.3115/v1/D14-1179 ］

Deng C ， Chen Q ， Qin P ， Chen D and Wu Q . 2023 . Prompt switch： efficient CLIP adaptation for text-video retrieval. 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE ： 15602 – 15612 ［ DOI： 10.1109/ICCV51070.2023.01434 http://dx.doi.org/10.1109/ICCV51070.2023.01434 ］

Devlin J ， Chang M W ， Lee K and Toutanova K . 2018 . BERT： pre-training of deep bidirectional transformers for language understanding ［EB/OL］.［ 2024-12-03 ］. https://arxiv.org/pdf/1810.04805.pdf https://arxiv.org/pdf/1810.04805.pdf

Dosovitskiy A ， Beyer L ， Kolesnikov A ， Weissenborn D ， Zhai X ， Unterthiner T ， Dehghani M ， Minderer M ， Heigold G ， Gelly S ， Uszkoreit J and Houlsby N . 2020 . An image is worth 16 x 16 words： transformers for image recognition at scale［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2010.11929.pdf http://arxiv.org/pdf/2010.11929.pdf

Dzabraev M ， Kalashnikov M ， Komkov S and Petiushko A . 2021 . MDMMT： Multidomain multimodal transformer for video retrieval. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Nashville， TN， USA ： IEEE ： 3349 – 3358 ［ DOI： 10.1109/CVPRW53098.2021.00374 http://dx.doi.org/10.1109/CVPRW53098.2021.00374 ］

Fang B ， Wu W ， Liu C ， Zhou Y ， Song Y ， Wang W ， Shu X ， Ji X and Wang J . 2023 . UATVR： uncertainty-adaptive text-video retrieval. 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE ： 13677 – 13687 ［ DOI： 10.1109/ICCV51070.2023.01262 http://dx.doi.org/10.1109/ICCV51070.2023.01262 ］

Fang H ， Xiong P ， Xu L and Chen Y . 2021 . CLIP2Video： mastering video-text retrieval via image CLIP ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2106.11097.pdf http://arxiv.org/pdf/2106.11097.pdf

Gabeur V ， Sun C ， Alahari K and Schmid C . 2020 . Multi-modal transformer for video retrieval. Computer Vision–ECCV 2020： 16th European Conference . Glasgow， UK ： Springer ： 214 – 229 ［ DOI： 10.1007/978-3-030-58548-8_13 http://dx.doi.org/10.1007/978-3-030-58548-8_13 ］

Ge Y ， Ge Y ， Liu X ， Li D ， Shan Y ， Qie X and Luo P . 2022 . Bridging video-text retrieval with multiple choice questions. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， LA， USA ： IEEE ： 16146 – 16155 ［ DOI： 10.1109/CVPR52688.2022.01569 http://dx.doi.org/10.1109/CVPR52688.2022.01569 ］

Gorti S K ， Vouitsis N ， Ma J ， Golestan K ， Volkovs M ， Garg A and Yu G . 2022 . X-Pool： cross-modal language-video attention for text-video retrieval. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， LA， USA ： IEEE ： 5006 – 5015 ［ DOI： 10.1109/CVPR52688.2022.00495 http://dx.doi.org/10.1109/CVPR52688.2022.00495 ］

Guan P ， Pei R ， Shao B ， Liu J ， Li W ， Gu J ， Xu H ， Xu S ， Yan Y and Lam E Y . 2023 . PIDRo： parallel isomeric attention with dynamic routing for text-video retrieval. 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE ： 11130 – 11139 ［ DOI： 10.1109/ICCV51070.2023.01025 http://dx.doi.org/10.1109/ICCV51070.2023.01025 ］

Han N ， Chen J ， Shi C ， Zeng Y ， Xiao G and Chen H . 2023 . BiC-Net： learning efficient spatio-temporal relation for text-video retrieval . ACM Transactions on Multimedia Computing， Communications and Applications ， 20 （ 3 ）： 1 – 21 ［ DOI： 10.1145/3627103 http://dx.doi.org/10.1145/3627103 ］

Hao X ， Zhang W ， Wu D ， Zhu F and Li B . 2023 . Dual alignment unsupervised domain adaptation for video-text retrieval. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， BC， Canada ： IEEE ： 18962 – 18972 ［ DOI： 10.1109/CVPR52729.2023.01818 http://dx.doi.org/10.1109/CVPR52729.2023.01818 ］

Hao Y ， Dong L ， Wei F and Xu K . 2019 . Visualizing and understanding the effectiveness of BERT ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1908.05620.pdf http://arxiv.org/pdf/1908.05620.pdf

He K ， Fan H ， Wu Y ， Xie S and Girshick R . 2020 . Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， WA， USA ： IEEE ： 9726 – 9735 ［ DOI： 10.1109/CVPR42600.2020.00975 http://dx.doi.org/10.1109/CVPR42600.2020.00975 ］

He K ， Zhang X ， Ren S and Sun J . 2016 . Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， NV， USA ： IEEE ： 770 – 778 ［ DOI： 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ］

Hendricks L A ， Wang O ， Shechtman E ， Sivic J ， Darrell T and Russell B . 2017 . Localizing moments in video with natural language. 2017 IEEE International Conference on Computer Vision . Venice ： IEEE ： 5804 – 5813 ［ DOI： 10.1109/ICCV.2017.618 http://dx.doi.org/10.1109/ICCV.2017.618 ］

Hershey S ， Chaudhuri S ， Ellis D P W ， Gemmeke J F ， Jansen A ， Moore R C ， Plakal M ， Platt D ， Saurous R A ， Seybold B ， Slaney M ， Weiss R J and Wilson K . 2017 . CNN architectures for large-scale audio classification. 2017 IEEE International Conference on Acoustics， Speech and Signal Processing . New Orleans， LA， USA ： IEEE ： 131 – 135 ［ DOI： 10.1109/ICASSP.2017.7952132 http://dx.doi.org/10.1109/ICASSP.2017.7952132 ］

Hochreiter S . 1997 . Long short-term memory . Neural Computation MIT-Press

Hu F ， Chen A ， Wang Z ， Zhou F ， Dong J and Li X . 2022 . Lightweight attentional feature fusion： A new baseline for text-to-video retrieval. European Conference on Computer Vision . Cham ： Springer ： 444 – 461 ［ DOI： 10.1007/978-3-031-19781-9_26 http://dx.doi.org/10.1007/978-3-031-19781-9_26 ］

Ibrahimi S ， Sun X ， Wang P ， Garg A ， Sanan A and Omar M . 2023 . Audio-Enhanced text-to-video retrieval using text-conditioned feature alignment. 2023 IEEE/CVF International Conference on Computer Vision . Paris， France； IEEE ： 12020 – 12030 ［ DOI： 10.1109/ICCV51070.2023.01107 http://dx.doi.org/10.1109/ICCV51070.2023.01107 ］

Jin P ， Huang J ， Liu F ， Wu X ， Ge S ， Song G ， Clifton D A and Chen J . 2022 . Expectation-Maximization contrastive learning for compact video-and-language representations . Advances in Neural Information Processing Systems ， 35 ： 30291 – 30306

Jin P ， Huang J ， Xiong P ， Tian S ， Liu C ， Ji X ， Yuan L and Chen J . 2023 . Video-Text as game players： hierarchical banzhaf interaction for cross-modal representation learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， BC， Canada ： IEEE ： 2472 – 2482 ［ DOI： 10.1109/CVPR52729.2023.00244 http://dx.doi.org/10.1109/CVPR52729.2023.00244 ］

Jin P ， Li H ， Cheng Z ， Huang J ， Wang Z ， Yuan L ， Liu C and Chen J . 2023 . Text-Video retrieval with disentangled conceptualization and set-to-set alignment ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2305.12218.pdf http://arxiv.org/pdf/2305.12218.pdf

Jin P ， Li H ， Cheng Z ， Li K ， Ji X ， Liu C ， Yuan L and Chen J . 2023 . DiffusionRet： generative text-video retrieval with diffusion model. 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE ： 2470 – 2481 ［ DOI： 10.1109/ICCV51070.2023.00234 http://dx.doi.org/10.1109/ICCV51070.2023.00234 ］

Jordan M I and Jacobs R A . 1994 . Hierarchical mixtures of experts and the EM algorithm . Neural Computation ， 6 （ 2 ）： 181 – 214 ［ DOI： 10.1162/neco.1994.6.2.181 http://dx.doi.org/10.1162/neco.1994.6.2.181 ］

Karpathy A ， Joulin A and Li F F . 2014 . Deep fragment embeddings for bidirectional image sentence mapping. Proceedings of the 27th International Conference on Neural Information Processing Systems . Cambridge， MA， USA ： MIT Press ： 1889 – 1897

Kay W ， Carreira J ， Simonyan K ， Zhang B ， Hillier C ， Vijayanarasimhan S ， Viola F ， Green T ， Back T ， Natsev P ， Suleyman M and Zisserman A . 2017 . The kinetics human action video dataset ［EB/OL］.［ 2024-12-03 ］. https://arxiv.org/pdf/1705.06950.pdf https://arxiv.org/pdf/1705.06950.pdf

Krishna R ， Hata K ， Ren F ， Li F F and Niebles J C . 2017 . Dense-Captioning events in videos. 2017 IEEE International Conference on Computer Vision . Venice ： IEEE ： 706 – 715 ［ DOI： 10.1109/ICCV.2017.83 http://dx.doi.org/10.1109/ICCV.2017.83 ］

Krishna R . 2017 . Visual genome： connecting language and vision using crowdsourced dense image annotations . International Journal of Computer Vision ， 123 （ 1 ）： 32 – 73 ［ DOI： 10.1007/s11263-016-0981-7 http://dx.doi.org/10.1007/s11263-016-0981-7 ］

Lei J ， Berg T L and Bansal M . 2022 . Revealing single frame bias for video-and-language learning ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2206.03428.pdf http://arxiv.org/pdf/2206.03428.pdf

Lei J ， Li L ， Zhou L ， Gan Z ， Berg T L ， Bansal M and Liu J . 2021 . Less is more： ClipBERT for video-and-language learning via sparse sampling. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， TN， USA ： IEEE ： 7331 – 7341 ［ DOI： 10.1109/CVPR46437.2021.00725 http://dx.doi.org/10.1109/CVPR46437.2021.00725 ］

Lei J ， Yu L ， Bansal M and Berg T L . 2019 . TVQA： localized， compositional video question answering ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1809.01696.pdf http://arxiv.org/pdf/1809.01696.pdf

Lei Y J ， Xu K ， Guo Y L ， Yang X ， Wu Y W ， Hu W ， Yang J Q and Wang H Y . 2024 . Comprehensive survey on 3D visual-language understanding techniques . Journal of Image and Graphics ， 29 （ 06 ）： 1747 - 1764

雷印杰，徐凯，郭裕兰，杨鑫，武玉伟，胡玮，杨佳琪，汪汉云 . 2024 . “ 三维视觉—语言”推理技术的前沿研究与最新趋势 . 中国图象图形学报， 29 （ 06 ）： 1747 - 1764 ［ DOI： 10. 11834/jig. 240029 http://dx.doi.org/10.11834/jig.240029 ］

Li P ， Xie C W ， Zhao L ， Xie H ， Ge J ， Zheng Y ， Zhao D and Zhang Y . 2023 . Progressive spatio-temporal prototype matching for text-video retrieval. 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE ： 4077 – 4087 ［ DOI： 10.1109/ICCV51070.2023.00379 http://dx.doi.org/10.1109/ICCV51070.2023.00379 ］

Li Y ， Yu J ， Gai K ， Liu B ， Xiong G and Wu Q . 2024 . T2VIndexer： a generative video indexer for efficient text-video retrieval. Proceedings of the 32nd ACM International Conference on Multimedia . Melbourne VIC Australia ： ACM ： 3955 – 3963 ［ DOI： 10.1145/3664647.3680673 http://dx.doi.org/10.1145/3664647.3680673 ］

Lin C ， Wu A ， Liang J ， Zhang J ， Ge W ， Zheng W S and Shen C . 2022 . Text-Adaptive multiple visual prototype matching for video-text retrieval . Advances in Neural Information Processing Systems ， 35 ： 38655 – 38666 ［ DOI： 10.48550/arXiv.2209.13307 http://dx.doi.org/10.48550/arXiv.2209.13307 ］

Liu H F ， Chen J J ， Li L ， Bao B K ， Li Z C ， Liu J Y and Nie L Q . 2023 . Cross-modal representation learning and generation . Journal of Image and Graphics ， 28 （ 06 ）： 1608 - 1629

刘华峰，陈静静，李亮，鲍秉坤，李泽超，刘家瑛，聂礼强 . 2023 . 跨模态表征与生成技术 . 中国图象图形学报， 28 （ 06 ）： 1608 - 1629 ［ DOI： 10. 11834/jig. 230035 http://dx.doi.org/10.11834/jig.230035 ］

Liu R ， Huang J ， Li G ， Feng J ， Wu X and Li T H . 2023 . Revisiting temporal modeling for CLIP-based image-to-video knowledge transferring. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， BC， Canada ： IEEE ： 6555 – 6564 ［ DOI： 10.1109/CVPR52729.2023.00634 http://dx.doi.org/10.1109/CVPR52729.2023.00634 ］

Liu S ， Fan H ， Qian S ， Chen Y ， Ding W and Wang Z . 2021 . HiT： hierarchical transformer with momentum contrast for video-text retrieval. 2021 IEEE/CVF International Conference on Computer Vision . Montreal， QC， Canada ： IEEE ： 11895 – 11905 ［ DOI： 10.1109/ICCV48922.2021.01170 http://dx.doi.org/10.1109/ICCV48922.2021.01170 ］

Liu Y ， Albanie S ， Nagrani A and Zisserman A . 2019 . Use what you have： Video retrieval using representations from collaborative experts ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1907.13487.pdf http://arxiv.org/pdf/1907.13487.pdf

Liu Y ， Xiong P ， Xu L ， Cao S and Jin Q . 2022 . TS2-Net： token shift and selection transformer for text-video retrieval. European Conference on Computer Vision . Cham ： Springer ： 319 – 335 ［ DOI： 10.1007/978-3-031-19781-9_19 http://dx.doi.org/10.1007/978-3-031-19781-9_19 ］

Luo H ， Ji L ， Zhong M ， Chen Y ， Lei W ， Duan N and Li T . 2022 . CLIP4Clip： an empirical study of CLIP for end to end video clip retrieval . Neurocomputing ， 508 ［ DOI： 10.1016/j.neucom.2022.07.028 http://dx.doi.org/10.1016/j.neucom.2022.07.028 ］

Ma Y ， Xu G ， Sun X ， Yan M ， Zhang J and Ji R . 2022 . X-CLIP： end-to-end multi-grained contrastive learning for video-text retrieval. Proceedings of the 30th ACM International Conference on Multimedia . New York， NY， USA ： Association for Computing Machinery ： 638 – 647 ［ DOI： 10.1145/3503161.3547910 http://dx.doi.org/10.1145/3503161.3547910 ］

Miech A ， Laptev I and Sivic J . 2018a . Learnable pooling with context gating for video classification ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1706.06905.pdf http://arxiv.org/pdf/1706.06905.pdf

Miech A ， Laptev I and Sivic J . 2018b . Learning a text-video embedding from incomplete and heterogeneous data ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1804.02516.pdf http://arxiv.org/pdf/1804.02516.pdf

Miech A ， Zhukov D ， Alayrac J B ， Tapaswi M ， Laptev I and Sivic J . 2019 . HowTo100M： learning a text-video embedding by watching hundred million narrated video clips. Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE ： 2630 – 2640 ［ DOI： 10.1109/ICCV.2019.00272 http://dx.doi.org/10.1109/ICCV.2019.00272 ］

Mikolov T ， Chen K ， Corrado G and Dean J . 2013 . Efficient estimation of word representations in vector space ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1301.3781.pdf http://arxiv.org/pdf/1301.3781.pdf

Oord A V D ， Li Y and Vinyals O . 2018 . Representation learning with contrastive predictive coding ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1807.03748.pdf http://arxiv.org/pdf/1807.03748.pdf

Otani M ， Nakashima Y ， Rahtu E ， Heikkilä J and Yokoya N . 2016 . Learning joint representations of videos and sentences with web image search. Computer Vision–ECCV 2016 Workshops： Amsterdam . The Netherlands ： Springer International Publishing ： 651 – 667 ［ DOI： 10.1007/978-3-319-46604-0_46 http://dx.doi.org/10.1007/978-3-319-46604-0_46 ］

Patrick M ， Huang P Y ， Asano Y ， Metze F ， Hauptmann A ， Henriques J and Vedaldi A . 2020 . Support-set bottlenecks for video-text representation learning ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2010.02824.pdf http://arxiv.org/pdf/2010.02824.pdf

Qing Z ， Zhang S ， Huang Z ， Zhang Y ， Gao C ， Zhao D and Sang N . 2023 . Disentangling spatial and temporal learning for efficient image-to-video transfer learning. 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE ： 13888 – 13898 ［ DOI： 10.1109/ICCV51070.2023.01281 http://dx.doi.org/10.1109/ICCV51070.2023.01281 ］

Radford A ， Kim J W ， Hallacy C ， Ramesh A ， Goh G ， Agarwal S ， Sastry G ， Askell A ， Mishkin P ， Clark J ， Krueger G and Sutskever I . 2021 . Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning . PMLR ： 8748 – 8763

Radford A ， Wu J ， Child R ， Luan D ， Amodei D and Sutskever I . 2019 . Language models are unsupervised multitask learners . OpenAI Blog ， 1 （ 8 ）： 9

Raﬀel C ， Shazeer N ， Roberts A ， Lee K ， Narang S ， Matena M ， Zhou Y ， Li W and Liu P J . 2020 . Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research ， 21 （ 140 ）： 1 – 67

Rohrbach A ， Rohrbach M ， Tandon N and Schiele B . 2015 . A dataset for movie description. 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston， MA， USA ： IEEE ： 3202 – 3212 ［ DOI： 10.1109/CVPR.2015.7298940 http://dx.doi.org/10.1109/CVPR.2015.7298940 ］

Sanh V ， Debut L ， Chaumond J and Wolf T . 2019 . DistilBERT， a distilled version of BERT： smaller， faster， cheaper and lighter ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/1910.01108.pdf http://arxiv.org/pdf/1910.01108.pdf

Schlichtkrull M ， Kipf T N ， Bloem P ， Berg R V D ， Titov I and Welling M . 2018 . Modeling relational data with graph convolutional networks. In the Semantic Web： 15th International Conference， ESWC 2018 . Heraklion， Crete， Greece ： Springer International Publishing ： 593 – 607 ［ DOI： 10.1007/978-3-319-93417-4_38 http://dx.doi.org/10.1007/978-3-319-93417-4_38 ］

Sharma P ， Ding N ， Goodman S and Soricut R . 2018 . Conceptual captions： A cleaned， hypernymed， image alt-text dataset for automatic image captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers） . Melbourne， Australia ： Association for Computational Linguistics ： 2556 – 2565 ［ DOI： 10.18653/v1/P18-1238 http://dx.doi.org/10.18653/v1/P18-1238 ］

Tran D ， Bourdev L ， Fergus R ， Torresani L and Paluri M . 2015 . Learning spatiotemporal features with 3D convolutional networks. Proceedings of the IEEE International Conference on Computer Vision . Santiago， Chile ： IEEE ： 4489 – 4497 ［ DOI： 10.1109/ICCV.2015.510 http://dx.doi.org/10.1109/ICCV.2015.510 ］

Tran D ， Wang H ， Torresani L ， Ray J ， LeCun Y and Paluri M . 2018 . A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， UT ： IEEE ： 6450 – 6459 ［ DOI： 10.1109/CVPR.2018.00675 http://dx.doi.org/10.1109/CVPR.2018.00675 ］

Vaswani A ， Shazeer N ， Parmar N ， Uszkoreit J ， Jones L ， Gomez A N ， Kaiser Ł and Polosukhin I . 2017 . Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， CA， USA ： 6000 – 6010

Wang A J ， Ge Y ， Cai G ， Yan R ， Lin X ， Shan Y ， Qie X and Shou M Z . 2022 . Object-aware video-language pre-training for retrieval. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， LA， USA ： 3303 – 3312 ［ DOI： 10.1109/CVPR52688.2022.00331 http://dx.doi.org/10.1109/CVPR52688.2022.00331 ］

Wang J ， Chen D ， Wu Z ， Luo C ， Zhou L ， Zhao Y ， Xie Y ， Liu C ， Jiang Y G and Yuan L . 2022 . OmniVL： one foundation model for image-language and video-language tasks . Advances in Neural Information Processing Systems ， 35 ： 5696 – 5710

Wang J ， Sun G ， Wang P ， Liu D ， Dianat S ， Rabbani M ， Rao R and Tao Z . 2024 . Text Is MASS： modeling as stochastic embedding for text-video retrieval. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， WA， USA ： IEEE ： 16551 – 16560 ［ DOI： 10.1109/CVPR52733.2024.01566 http://dx.doi.org/10.1109/CVPR52733.2024.01566 ］

Wang W J ， Yang W H ， Fang Y M ， Huang H and Liu J Y . 2024 . Visual perception and understanding in degraded scenarios . Journal of Image and Graphics ， 29 （ 06 ）： 1667 - 1684

汪文靖，杨文瀚，方玉明，黄华，刘家瑛 . 2024 . 恶劣场景下视觉感知与理解综述 . 中国图象图形学报， 29 （ 06 ）： 1667 - 1684 ［ DOI： 10. 11834/jig. 240041 http://dx.doi.org/10.11834/jig.240041 ］

Wang X ， Wu J ， Chen J ， Li L ， Wang Y F and Wang W Y . 2019 . VaTeX： a large-scale， high-quality multilingual dataset for video-and-language research. 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE ： 4580 – 4590 ［ DOI： 10.1109/ICCV.2019.00468 http://dx.doi.org/10.1109/ICCV.2019.00468 ］

Wang X ， Zhu L and Yang Y . 2021 . T2VLAD： global-local sequence alignment for text-video retrieval. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， TN， USA ： IEEE ： 5075 – 5084 ［ DOI： 10.1109/CVPR46437.2021.00504 http://dx.doi.org/10.1109/CVPR46437.2021.00504 ］

Wang Y ， Li K ， Li Y ， He Y ， Huang B ， Zhao Z ， Zhang H ， Xu J ， Liu Y ， Wang Z ， Xing S ， Chen G ， Pan J ， Yu J ， Wang Y ， Wang L and Qiao Y . 2022 . InternVideo： general video foundation models via generative and discriminative learning ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2212.03191.pdf http://arxiv.org/pdf/2212.03191.pdf

Wang Z ， Sung Y L ， Cheng F ， Bertasius G and Bansal M . 2023 . Unified coarse-to-fine alignment for video-text retrieval. 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE ： 2804 – 2815 ［ DOI： 10.1109/ICCV51070.2023.00264 http://dx.doi.org/10.1109/ICCV51070.2023.00264 ］

Wu P ， He X ， Tang M ， Lv Y and Liu J . 2021 . HANet： hierarchical alignment networks for video-text retrieval . Proceedings of the 29th ACM International Conference on Multimedia ， 3518 – 3527 ［ DOI： 10.1145/3474085.3475515 http://dx.doi.org/10.1145/3474085.3475515 ］

Wu W ， Luo H ， Fang B ， Wang J and Ouyang W . 2023 . Cap4Video： what can auxiliary captions do for text-video retrieval？ 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， BC， Canada ： IEEE ： 10704 – 10713 ［ DOI： 10.1109/CVPR52729.2023.01031 http://dx.doi.org/10.1109/CVPR52729.2023.01031 ］

Xie S ， Sun C ， Huang J ， Tu Z and Murphy K . 2018 . Rethinking spatiotemporal feature learning： speed-accuracy trade-offs in video classification . Proceedings of the European Conference on Computer Vision ： Cham： Springer International Publishing ： 305 – 321 ［ DOI： 10.1007/978-3-030-01267-0_19 http://dx.doi.org/10.1007/978-3-030-01267-0_19 ］

Xu J ， Mei T ， Yao T and Rui Y . 2016 . MSR-VTT： a large video description dataset for bridging video and language. 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， NV， USA ： IEEE ： 5288 – 5296 ［ DOI： 10.1109/CVPR.2016.571 http://dx.doi.org/10.1109/CVPR.2016.571 ］

Yang J ， Bisk Y and Gao J . 2021 . TACo： token-aware cascade contrastive learning for video-text alignment. Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal， QC， Canada ： IEEE ： 11562 – 11572 ［ DOI： 10.1109/ICCV48922.2021.01136 http://dx.doi.org/10.1109/ICCV48922.2021.01136 ］

Yang X ， Zhu L ， Wang X and Yang Y . 2024 . DGL： dynamic global-local prompt tuning for text-video retrieval . Proceedings of the AAAI Conference on Artificial Intelligence ， 38 （ 7 ）： 6540 – 6548 . ［ DOI： 10.1609/aaai.v38i7.28475 http://dx.doi.org/10.1609/aaai.v38i7.28475 ］

Yao H ， Wu W and Li Z . 2023 . Side4Video： spatial-temporal side network for memory-efficient image-to-video transfer learning ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2311.15769.pdf http://arxiv.org/pdf/2311.15769.pdf

Yu Y ， Kim J and Kim G . 2018 . A joint sequence fusion model for video question answering and retrieval . European Conference on Computer Vision ： Cham： Springer International Publishing ： 487 – 503 ［ DOI： 10.1007/978-3-030-01234-2_29 http://dx.doi.org/10.1007/978-3-030-01234-2_29 ］

Yuan L ， Chen D ， Chen Y L ， Codella N ， Dai X ， Gao J ， Hu H ， Huang X ， Li B ， Li C ， Liu C ， Liu M ， Liu Z ， Lu Y ， Shi Y ， Wang L ， Wang J ， Xiao B ， Xiao Z ， Yang J W ， Zeng M ， Zhou L W ， Zhang P . 2021 . Florence： a new foundation model for computer vision ［EB/OL］.［ 2024-12-03 ］. http://arxiv.org/pdf/2111.11432.pdf http://arxiv.org/pdf/2111.11432.pdf

Zhang Y ， Jin R and Zhou Z H . 2010 . Understanding bag-of-words model： a statistical framework . International Journal of Machine Learning and Cybernetics ， 1 ： 43 – 52 ［ DOI： 10.1007/s13042-010-0001-0 http://dx.doi.org/10.1007/s13042-010-0001-0 ］

Zhao S ， Zhu L ， Wang X and Yang Y . 2022 . CenterCLIP： token clustering for efficient text-video retrieval . Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval ， 970 – 981 ［ DOI： 10.1145/3477495.3531950 http://dx.doi.org/10.1145/3477495.3531950 ］

Zhou L ， Xu C and Corso J . 2018 . Towards automatic learning of procedures from web instructional videos . Proceedings of the AAAI Conference on Artificial Intelligence ， 32 （ 1 ）［DOI：10.1609/aaai.v32i1.12342］

Zhu C ， Jia Q ， Chen W ， Guo Y and Liu Y . 2023 . Deep learning for video-text retrieval： a review . International Journal of Multimedia Information Retrieval ， 12 （ 1 ）： 3 ［ DOI： 10.1007/s13735-023-00267-8 http://dx.doi.org/10.1007/s13735-023-00267-8 ］

Zhu L and Yang Y . 2020 . ActBERT： learning global-local video-text representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， WA， USA ： IEEE ： 8743 – 8752 ［ DOI： 10.1109/CVPR42600.2020.00877 http://dx.doi.org/10.1109/CVPR42600.2020.00877 ］

文章被引用时，请邮件提醒。

提交

结合坐标转换和时空信息注入的点云人体行为识别