多类型提示互补的弱监督时序动作定位
Complementary multi-type prompts for weakly-supervised temporal action location
- 2025年30卷第3期 页码:842-854
收稿日期:2024-06-26,
修回日期:2024-09-04,
纸质出版日期:2025-03-16
DOI: 10.11834/jig.240354
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-06-26,
修回日期:2024-09-04,
纸质出版日期:2025-03-16
移动端阅览
目的
2
弱监督时序动作定位仅利用视频级标注来定位动作实例的起止时间并识别其类别。目前基于视觉语言的方法利用文本提示信息来提升时序动作定位模型的性能。在视觉语言模型中,动作标签文本通常被封装为文本提示信息,按类型可分为手工类型提示(handcrafted prompts)和可学习类型提示(learnable prompts),而现有方法忽略了二者间的互补性,使得引入的文本提示信息无法充分发挥其引导作用。为此,提出一种多类型提示互补的弱监督时序动作定位模型(multi-type prompts complementary model for weakly-supervised temporal action location)。
方法
2
首先,设计提示交互模块,针对不同类型的文本提示信息分别与视频进行交互,并通过注意力加权,从而获得不同尺度的特征信息;其次,为了实现文本与视频对应关系的建模,本文利用一种片段级对比损失来约束文本提示信息与动作片段之间的匹配;最后,设计阈值筛选模块,将多个分类激活序列(class activation sequence,CAS)中的得分进行筛选比较,以增强动作类别的区分性。
结果
2
在3个具有代表性的数据集THUMOS14、ActivityNet1.2和ActivityNet1.3上与同类方法进行比较。本文方法在THUMOS14数据集中的平均精度均值(mean average precision,mAP)(0.1∶0.7)取得39.1%,在ActivityNet1.2中mAP(0.5∶0.95)取得27.3%,相比于P-MIL(proposal-based multiple instance learning)方法分别提升1.1%和1%。而在ActivityNet1.3数据集中mAP(0.5∶0.95)取得了与对比工作相当的性能,平均mAP达到26.7%。
结论
2
本文提出的时序动作定位模型,利用两种类型文本提示信息的互补性来引导模型定位,提出的阈值筛选模块可以最大化利用两种类型文本提示信息的优势,最大化其辅助作用,使定位的结果更加准确。
Objective
2
Weakly supervised temporal action localization uses only video-level annotations to locate the start and end times of action instances and identify their categories. Only video-level annotations are available in weakly supervised environments; thus, directly designing a loss function for the task is impossible. Therefore, the existing work generally adopts the strategy of “localization by classification” and utilizes multi-example learning for training. However, this process has some limitations: 1) Localization and classification are two different tasks, revealing a notable gap between them; therefore, localization based on classification results may affect the final performance. 2) In weakly-supervised environments, fine-grained supervisory information to effectively distinguish between actions and backgrounds in videos is lacking, thereby posing a remarkable challenge for localization. Visual language models have recently received extensive attention. These models aim to model the correspondence between images and texts for more comprehensive visual perception. Specific textual prompts can improve the performance and robustness of the models to effectively apply large models to downstream tasks. Visual language-based approaches currently utilize auxiliary textual prompt information to compensate for supervisory information and improve the performance and robustness of temporal action localization models. In visual language models, action label text is regularly encapsulated as textual prompts, which can be categorized into Handcrafted Prompts and Learnable Prompts. Handcrafted Prompts comprise fixed templates and action labels (e.g., “a video of {class})”, which can learn a more generalized knowledge of the action class but lacks the specific knowledge of the relevant action. In contrast, Learnable Prompts comprise a set of learnable vectors, which can be adjusted and optimized during the training process. Therefore, the learnable type cues can learn more specific knowledge. The two types of text cues complement each other, improving the capability to discriminate different categories. However, the existing methods ignore the complementarity between the two, hindering the introduced text cues to maximize its guiding role and bringing certain noise information, which leads to inaccurate localization of action boundaries. Therefore, this paper proposes a multitype prompt complementary model for weakly-supervised temporal action location, which maximizes the guiding effect of textual cues to improve the accuracy of action localization. The methodology of this paper focuses on improving the accuracy of action location by maximizing the guidance of textual cues.
Method
2
First, this paper designs a prompt interaction module from the complementarity of textual prompts, which matches manual and learnable type prompts with action segments to obtain different similarity matrices. Then, the intrinsic connection between textual and segment features is mined through the attention layer and fused to realize the interaction between different types of textual prompts. Additionally, text cues must be effectively matched with action segments to play their guiding role. Therefore, this paper introduces the segment-level contrast loss, which is used to constrain the matching between cue messages and action segments. Finally, this paper designs a threshold filtering module to filter the class activation sequence (CAS) obtained from the guidance of different types of textual cue messages according to the threshold value. The final CAS is obtained by summing the CAS obtained after multilayer filtering according to a specific scale parameter, with the CAS obtained using only video-level features, covering the parts of each sequence with higher confidence scores, thus realizing the complementary advantages between different types of text cue information.
Result
2
Extensive experiments on three representative datasets, THUMOS14, ActivityNet1.2, and ActivityNet1.3, validate the effectiveness of the method proposed in this paper. Among them, different mAP(0.1∶0.5), mAP(0.1∶0.7), and mAP(0.3∶0.7) on THUMOS14 datasets achieved 58.2%, 39.1%, and 47.9% average performance, respectively, which is comparable to P-MIL (proposal-based multiple instance learning) average performance by up to 1.1%. In the ActivityNet1.2 datasets, the method proposed in this paper achieves a performance of 27.3% at mAP (0.5∶0.95), which is an average improvement of nearly 1% compared to the state-of-the-art method. The mAP (0.5∶0.95) of ActivityNet1.3 datasets achieves comparable performance to the same work, with an average mAP of 26.7%. In addition, numerous ablation experiments were conducted on the THUMOS14 datasets, and the experimental results proved the effectiveness of the modules.
Conclusion
2
This paper proposes a new weakly supervised temporal action localization model based on the complementarity of multiple types of cues, which alleviates the problem of inaccurate localization boundaries by leveraging the complementarity between different types of textual cues. The cue interaction module is designed in a targeted way to realize the interaction of different types of textual cue information. In addition, this paper introduces clip-level contrast loss to model the correspondence between text cue information and video, effectively constraining the matching between the introduced text cue information and action clips. Finally, this paper designs a multilayer nested threshold screening module, which can maximize the advantages of two different types of textual cue information, avoid the interference of noisy information on the model, and maximize the auxiliary role of textual information. Extensive experiments on two challenging datasets demonstrate the effectiveness of the method proposed in this paper.
Carreira J and Zisserman A . 2017 . Quo Vadis, action recognition? A new model and the kinetics dataset // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu, USA : IEEE: 4724 - 4733 [ DOI: 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ]
Chen M Y , Gao J Y , Yang S C and Xu C S . 2022 . Dual-evidential learning for weakly-supervised temporal action localization // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv, Israel : Springer: 192 - 208 [ DOI: 10.1007/978-3-031-19772-7_12 http://dx.doi.org/10.1007/978-3-031-19772-7_12 ]
Cheng F and Bertasius G . 2022 . TALLFormer: temporal action localization with a long-memory transformer [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2204.01680.pdf https://arxiv.org/pdf/2204.01680.pdf
Duval V , Aujol J F and Gousseau Y . 2009 . The TVL1 model: a geometric point of view . Multiscale Modeling and Simulation , 8 ( 1 ): 154 - 189 [ DOI: 10.1137/090757083 http://dx.doi.org/10.1137/090757083 ]
Gaidon A , Harchaoui Z and Schmid C . 2013 . Temporal localization of actions with actoms . IEEE Transactions on Pattern Analysis and Machine Intelligence , 35 ( 11 ): 2782 - 2795 [ DOI: 10.1109/TPAMI.2013.65 http://dx.doi.org/10.1109/TPAMI.2013.65 ]
Gao J Y , Chen M Y and Xu C S . 2022 . Fine-grained temporal contrastive learning for weakly-supervised temporal action localization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2203.16800.pdf https://arxiv.org/pdf/2203.16800.pdf
He B , Yang X T , Kang L , Cheng Z Y , Zhou X and Shrivastava A . 2022 . ASM-Loc: action-aware segment modeling for weakly-supervised temporal action localization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2203.15187.pdf https://arxiv.org/pdf/2203.15187.pdf
He D L , Zhao X , Huang J Z , Li F , Liu X and Wen S L . 2019 . Read, watch, and move: reinforcement learning for temporally grounding natural language descriptions in videos // Proceedings of the 33rd AAAI Conference on Artificial Intelligence . Honolulu, USA : AAAI: 8393 - 8400 [ DOI: 10.1609/aaai.v33i01.33018393 http://dx.doi.org/10.1609/aaai.v33i01.33018393 ]
Heilbron F C , Escorcia V , Ghanem B and Niebles J C . 2015 . ActivityNet: a large-scale video benchmark for human activity understanding // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston, USA : IEEE: 961 - 970 [ DOI: 10.1109/CVPR.2015.7298698 http://dx.doi.org/10.1109/CVPR.2015.7298698 ]
Hong F T , Feng J C , Xu D , Shan Y and Zheng W S . 2021 . Cross-modal consensus network for weakly supervised temporal action localization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2107.12589.pdf https://arxiv.org/pdf/2107.12589.pdf
Huang L J , Wang L and Li H S . 2022 . Weakly supervised temporal action localization via representative snippet knowledge propagation [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2203.02925.pdf https://arxiv.org/pdf/2203.02925.pdf
Idrees H , Zamir A R , Jiang Y G , Gorban A , Laptev I , Sukthankar R and Shah M . 2017 . The THUMOS challenge on action recognition for videos “in the wild” . Computer Vision and Image Understanding , 155 : 1 - 23 [ DOI: 10.1016/j.cviu.2016.10.018 http://dx.doi.org/10.1016/j.cviu.2016.10.018 ]
Islam A , Long C J and Radke R . 2021 . A hybrid attention mechanism for weakly-supervised temporal action localization // Proceedings of the 35th AAAI Conference on Artificial Intelligence . Virtually : AAAI: 1637 - 1645 [ DOI: 10.1609/aaai.v35i2.16256 http://dx.doi.org/10.1609/aaai.v35i2.16256 ]
Ju C , Zheng K H , Liu J X , Zhao P S , Zhang Y , Chang J L , Wang Y F and Tian Q . 2022 . Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2212.09335.pdf https://arxiv.org/pdf/2212.09335.pdf
Kingma D P and Ba J L . 2017 . Adam: a method for stochastic optimization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf
Lee P , Uh Y and Byun H . 2020 . Background suppression network for weakly-supervised temporal action localization // Proceedings of the 34th AAAI Conference on Artificial Intelligence . New York, USA : IEEE: 11320 - 11327 [ DOI: 10.1609/aaai.v34i07.6793 http://dx.doi.org/10.1609/aaai.v34i07.6793 ]
Li G Z , Cheng D , Ding X P , Wang N N , Wang X Y and Gao X B . 2023 . Boosting weakly-supervised temporal action localization with text information [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2305.00607.pdf https://arxiv.org/pdf/2305.00607.pdf
Li J J , Yang T Y , Ji W , Wang J and Cheng L . 2022a . Exploring denoised cross-video contrast for weakly-supervised temporal action localization // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, USA : IEEE: 19882 - 19892 [ DOI: 10.1109/CVPR52688.2022.01929 http://dx.doi.org/10.1109/CVPR52688.2022.01929 ]
Li Z Q , Ge Y X , Yu J R and Chen Z M . 2022b . Forcing the whole video as background: an adversarial learning strategy for weakly temporal action localization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2207.06659.pdf https://arxiv.org/pdf/2207.06659.pdf
Liu D C , Jiang T T and Wang Y Z . 2019 . Completeness modeling and context separation for weakly supervised temporal action localization // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 1298 - 1307 [ DOI: 10.1109/CVPR.2019.00139 http://dx.doi.org/10.1109/CVPR.2019.00139 ]
Liu Z Y , Wang L , Zhang Q L , Tang W , Yuan J S , Zheng N N and Hua G . 2021 . ACSNet: action-context separation network for weakly supervised temporal action localization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2103.15088.pdf https://arxiv.org/pdf/2103.15088.pdf
Ma J W , Gorti S K , Volkovs M and Yu G W . 2021 . Weakly supervised action selection learning in video [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2105.02439.pdf https://arxiv.org/pdf/2105.02439.pdf
Narayan S , Cholakkal H , Hayat M , Khan F S , Yang M H and Shao L . 2021 . D 2 -net: weakly-supervised action localization via discriminative embeddings and denoised activations [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2012.06440.pdf https://arxiv.org/pdf/2012.06440.pdf
Paul S , Roy S and Roy-Chowdhury A K . 2018 . W-TALC: weakly-supervised temporal activity localization and classification [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/1807.10418.pdf https://arxiv.org/pdf/1807.10418.pdf
Radford A , Kim J W , Hallacy C , Ramesh A , Goh G , Agarwal S , Sastry G , Askell A , Mishkin P , Clark J , Krueger G and Sutskever I . 2021 . Learning transferable visual models from natural language supervision [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2103.00020.pdf https://arxiv.org/pdf/2103.00020.pdf
Ren H , Yang W F , Zhang T Z and Zhang Y D . 2023 . Proposal-based multiple instance learning for weakly-supervised temporal action localization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2305.17861.pdf https://arxiv.org/pdf/2305.17861.pdf
Shi H C , Zhang X Y , Li C S , Gong L X , Li Y and Bao Y J . 2022 . Dynamic graph modeling for weakly-supervised temporal action localization // Proceedings of the 30th ACM International Conference on Multimedia . Lisboa, Portugal : ACM: 3820 - 3828 [ DOI: 10.1145/3503161.3548077 http://dx.doi.org/10.1145/3503161.3548077 ]
Shou Z , Chan J , Zareian A , Miyazawa K and Chang S F . 2017 . CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu, USA : IEEE: 1417 - 1426 [ DOI: 10.1109/CVPR.2017.155 http://dx.doi.org/10.1109/CVPR.2017.155 ]
Singh K K and Lee Y J . 2017 . Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/1704.04232.pdf https://arxiv.org/pdf/1704.04232.pdf
Tapaswi M , Zhu Y K , Stiefelhagen R , Torralba A , Urtasun R and Fidler S . 2016 . MovieQA: understanding stories in movies through question-answering // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas, USA : IEEE: 4631 - 4640 [ DOI: 10.1109/CVPR.2016.501 http://dx.doi.org/10.1109/CVPR.2016.501 ]
van den Oord A , Li Y Z and Vinyals O . 2019 . Representation learning with contrastive predictive coding [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/1807.03748.pdf https://arxiv.org/pdf/1807.03748.pdf
Wang L M , Xiong Y J , Lin D H and Van Gool L . 2017 . UntrimmedNets for weakly supervised action recognition and detection // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu, USA : IEEE: 6402 - 6411 [ DOI: 10.1109/CVPR.2017.678 http://dx.doi.org/10.1109/CVPR.2017.678 ]
Wang Y , Li Y D and Wang H B . 2023 . Two-stream networks for weakly-supervised temporal action localization with semantic-aware mechanisms // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 18878 - 18887 [ DOI: 10.1109/CVPR52729.2023.01810 http://dx.doi.org/10.1109/CVPR52729.2023.01810 ]
Wu W H , Luo H P , Fang B , Wang J D and Ouyang W L . 2023 . Cap4Video: what can auxiliary captions do for text-video retrieval? [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2301.00184.pdf https://arxiv.org/pdf/2301.00184.pdf
Xiong C X , Guo D and Liu X L . 2020 . Temporal proposal optimization for temporal action detection . Journal of Image and Graphics , 25 ( 7 ): 1447 - 1458
熊成鑫 , 郭丹 , 刘学亮 . 2020 . 时域候选优化的时序动作检测 . 中国图象图形学报 , 25 ( 7 ): 1447 - 1458 [ DOI: 10.11834/jig.190440 http://dx.doi.org/10.11834/jig.190440 ]
Xu M M , Zhao C , Rojas D S , Thabet A and Ghanem B . 2020 . G-TAD: sub-graph localization for temporal action detection // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 10153 - 10162 [ DOI: 10.1109/CVPR42600.2020.01017 http://dx.doi.org/10.1109/CVPR42600.2020.01017 ]
Yang W F , Zhang T Z , Yu X Y , Qi T , Zhang Y D and Wu F . 2021 . Uncertainty guided collaborative training for weakly supervised temporal action detection // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 53 - 63 [ DOI: 10.1109/CVPR46437.2021.00012 http://dx.doi.org/10.1109/CVPR46437.2021.00012 ]
Yang Z C , Qin J and Huang D . 2022 . ACGNet: action complement graph network for weakly-supervised temporal action localization // Proceedings of the 36th AAAI Conference on Artificial Intelligence . Virtually : AAAI: 3090 - 3098 [ DOI: 10.1609/aaai.v36i3.20216 http://dx.doi.org/10.1609/aaai.v36i3.20216 ]
Zhai Y H , Wang L , Tang W , Zhang Q L , Zheng N N , Doermann D , Yuan J S and Hua G . 2023 . Adaptive two-stream consensus network for weakly-supervised temporal action localization . IEEE Transactions on Pattern Analysis and Machine Intelligence , 45 ( 4 ): 4136 - 4151 [ DOI: 10.1109/TPAMI.2022.3189662 http://dx.doi.org/10.1109/TPAMI.2022.3189662 ]
Zhai Y H , Wang L , Tang W , Zhang Q L , Zheng N N and Hua G . 2022 . Action coherence network for weakly-supervised temporal action localization . IEEE Transactions on Multimedia , 24 : 1857 - 1870 [ DOI: 10.1109/TMM.2021.3073235 http://dx.doi.org/10.1109/TMM.2021.3073235 ]
Zhang C , Cao M , Yang D M , Chen J and Zou Y X . 2021 . CoLA: weakly-supervised temporal action localization with snippet contrastive learning [EB/OL]. [ 2024-06-26 ]. https://arxiv.org/pdf/2103.16392.pdf https://arxiv.org/pdf/2103.16392.pdf
Zhou K Y , Yang J K , Loy C C and Liu Z W . 2022 . Learning to prompt for vision-language models . International Journal of Computer Vision , 130 ( 9 ): 2337 - 2348 [ DOI: 10.1007/s11263-022-01653-1 http://dx.doi.org/10.1007/s11263-022-01653-1 ]
相关文章
相关作者
相关机构