多类型提示互补的弱监督时序动作定位
Complementary multi-type prompts for weakly-supervised temporal action location
- 2024年 页码:1-12
网络出版日期: 2024-09-10
DOI: 10.11834/jig.240354
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-09-10 ,
移动端阅览
任小龙,张飞飞,周婉婷等.多类型提示互补的弱监督时序动作定位[J].中国图象图形学报,
Ren Xiaolong,Zhang Feifei,Zhou Wanting,et al.Complementary multi-type prompts for weakly-supervised temporal action location[J].Journal of Image and Graphics,
目的
2
弱监督时序动作定位仅利用视频级标注来定位动作实例的起止时间并识别其类别。目前基于视觉语言的方法利用文本提示信息来提升时序动作定位模型的性能。在视觉语言模型中,动作标签文本通常被封装为文本提示信息,按类型可分为手工类型提示(Handcrafted Prompts)和可学习类型提示(Learnable Prompts),而现有方法忽略了二者间的互补性,使得引入的文本提示信息无法充分发挥其引导作用。为此,本文提出一种多类型提示互补的弱监督时序动作定位模型(Multi-type Prompts Complementary Model for Weakly-supervised Temporal Action Location)。
方法
2
首先,设计提示交互模块,针对不同类型的文本提示信息分别与视频进行交互,并通过注意力加权,从而获得不同尺度的特征信息;其次,为了实现文本与视频对应关系的建模,本文利用一种片段级对比损失来约束文本提示信息与动作片段之间的匹配;最后,设计阈值筛选模块,将多个分类激活序列(class activation sequence,CAS)中的得分进行筛选比较,以增强动作类别的区分性。
结果
2
在三个具有代表性的数据集THUMOS-14、ActivityNet-1.2和ActivityNet-1.3上与同类方法进行比较。分别在THUMOS-14数据集中的平均精度均值(mean average precision,mAP)(0.1:0.7)和ActivityNet-1.2中mAP(0.5: 0.95)实现了39.1%和27.3%的平均性能,相比于2023年的P-MIL(Proposal-Based Multiple Instance Learning)分别提高了1.1%和1%。而ActivityNet-1.3数据集中mAP(0.5:0.95)取得了与同期工作相当的性能,平均mAP达到26.7%。
结论
2
本文所提出的时序动作定位模型,利用两种类型文本提示信息的互补性来引导模型定位,提出的阈值筛选模块可以最大化利用两种类型文本提示信息的优势,最大化其辅助作用,使定位的结果更加准确。
Objective
2
Weakly supervised temporal action localization uses only video-level annotations to locate the start and end times of action instances and identify their categories. Since only video-level annotations are available in weakly supervised environments, it is not possible to directly design a loss function for the task, so existing work generally adopts the strategy of “localization by classification” and utilizes multi-example learning for training. However, there are some limitations in this process: (1) Localization and classification are actually two different tasks, and there is an obvious gap between them, so localization based on classification results may affect the final performance. (2) In weakly supervised environments, there is a lack of fine-grained supervisory information to effectively distinguish between actions and backgrounds in videos, which poses a great challenge for localization. Recently, visual language models have received extensive attention. These models aim to model the correspondence between images and texts for more comprehensive visual perception. In order to better apply large models to downstream tasks, specific textual prompts can improve the performance and robustness of the models. Currently, visual language-based approaches utilize auxiliary textual prompt information to compensate for supervisory information to improve the performance and robustness of temporal action localization models. In visual language models, action label text is usually encapsulated as textual prompts, which can be categorized into Handcrafted Prompts and Learnable Prompts. Handcrafted Prompts are composed of fixed templates and action labels, e.g., “a video of {class}”, which can learn a more generalized knowledge of the action class but lacks the specific knowledge of the relevant action, while Learnable Prompts are composed of a set of learnable vectors, which can be adjusted and optimized during the training process. and optimized during the training process, so the learnable type cues can learn more specific knowledge. Therefore, the two types of text cues can complement each other to improve the ability to discriminate different categories. However, the existing methods ignore the complementarity between the two, which makes the introduced text cues unable to give full play to its guiding role and bring certain noise information, which leads to inaccurate localization of action boundaries. Therefore, this paper proposes a Multi-type Prompts Complementary Model for Weakly-supervised Temporal Action Location, which maximizes the guiding effect of textual cues to improve the accuracy of action localization. The methodology of this paper is to improve the accuracy of action location by maximizing the guidance of textual cues.
Method
2
First of all, this paper designs a prompt interaction module from the complementarity of textual prompts, which matches manual type prompts and learnable type prompts with action segments to obtain different similarity matrices, and then mines the intrinsic connection between textual features and segment features through the attention layer and fuses them, so as to realize the interaction between different types of textual prompts. At the same time, text cues need to be better matched with action segments in order to play their guiding role. For this reason, this paper introduces the segment-level contrast loss, which is used to constrain the matching between cue messages and action segments. Finally, this paper designs a threshold filtering module to filter the Class Activation Sequence (CAS) obtained from the guidance of different types of textual cue messages according to the threshold value. Then, the final CAS is obtained by summing the CAS obtained after multi-layer filtering according to a specific scale parameter with the CAS obtained by using only video-level features, which covers the parts of each sequence with higher confidence scores, thus realizing the complementary advantages between different types of text cue information.
Result
2
Extensive experiments on three representative datasets, THUMOS-14, ActivityNet-1.2 and ActivityNet-1.3, validate the effectiveness of the method proposed in this paper. Among them, different mAP(0.1:0.5), mAP(0.1:0.7), mAP(0.3:0.7) on THUMOS-14 datasets achieved 58.2%, 39.1%, and 47.9% average performance, respectively, which is comparable to the 2023 P-MIL (Proposal-Based Multiple Instance Learning) average performance by up to 1.1%. In the ActivityNet-1.2 datasets, the method proposed in this paper achieves a performance of 27.3% at mAP(0.5:0.95), which is an average improvement of nearly 1% compared to the state-of-the-art method. And the ActivityNet-1.3 datasets mAP(0.5:0.95) achieves comparable performance to the same work, with an average mAP of 26.7%. In addition, a large number of ablation experiments were conducted on the THUMOS-14 datasets, and the experimental results proved the effectiveness of the modules.
Conclusion
2
In this paper, we propose a new weakly supervised temporal action localization model based on complementarity of multiple types of cues, which alleviates the problem of inaccurate localization boundaries by leveraging the complementarity between different types of textual cues. The cue interaction module is designed in a targeted way in order to realize the interaction of different types of textual cue information. In addition, this paper introduces clip-level contrast loss to realize the modeling of the correspondence between text cue information and video, so as to better constrain the matching between the introduced text cue information and action clips. Finally, this paper designs a multi-layer nested threshold screening module, which can make full use of the advantages of two different types of textual cue information, avoid the interference of noisy information on the model, and maximize the auxiliary role of textual information. Extensive experiments on two challenging datasets demonstrate the effectiveness of the method proposed in this paper.
弱监督时序动作定位视觉语言模型手工类型提示可学习类型提示分类激活序列
weakly supervised temporal action locationvisual language modelhandcrafted Promptslearnable Promptsclass Activation Sequence
Cheng F and Bertasius G. 2022. TALLFormer: Temporal Action Locali- zation with Long-memory Transformer [EB/OL]. [2023-09-24]. https://arxiv.org/pdf/2204.01680.pdfhttps://arxiv.org/pdf/2204.01680.pdf
Chen M, Gao J and Yang S. 2022. Dual-evidential learning for wea- kly supervised temporal action localization. Glasgow, UK: Springer: 192-208 [DOI:https://doi.org/10.1007/978-3-031-19772 -7_12http://dx.doi.org/https://doi.org/10.1007/978-3-031-19772-7_12]
Carreira J and Zisserman A. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Honolulu, USA: IEEE: 6299-6308 [DOI:10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502]
Duval V , Ois J F and Aujol. 2017. the tvl1 model: a geometric point of view [EB/OL]. [2023-09-24]. https://doi.org/10.1137/090757083https://doi.org/10.1137/090757083
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochasti- coptimization [EB/OL]. [2023-09-24]. https://arxiv.org/pdf/1412.6980.pdfhttps://arxiv.org/pdf/1412.6980.pdf
Gao J, Chen M and Xu C. 2022. Fine-grained Temporal Contrastive Le- arning for Weakly-supervised Temporal Action Localization [EB/ OL]. [2023-09-24]. https://arxiv.org/pdf/2203.16800.pdfhttps://arxiv.org/pdf/2203.16800.pdf
Gaidon A, Harchaoui Z and Schmid C. 2013. Temporal Localization of Actions with Actoms. IEEE: 2782-2795 [DOI:10.1109/TPA MI.2013.65http://dx.doi.org/10.1109/TPAMI.2013.65]
He D , Zhao X and Huang J. 2019. Read, Watch, and Move: Reinfor- cement Learning for Temporally Grounding Natural Language Descriptions in Videos. Honolulu, USA: AAAI: 8393-8400 [DOI:10.1609/aaai.v33i01.3301839http://dx.doi.org/10.1609/aaai.v33i01.3301839]
Hong F T , Feng J C and Xu D. 2021. Cross-modal Consensus Netw- ork for Weakly Supervised Temporal Action Localization [EB/OL]. [2023-09-24]. https://arxiv.org/pdf/2107.12589.pdfhttps://arxiv.org/pdf/2107.12589.pdf
Heilbron F C, Escorcia V and Ghanem B. 2015. ActivityNet: A large scale video benchmark for human activity understanding. Boston, USA: IEEE: 961-970 [DOI: 10.1109/CVPR.2015.7298698http://dx.doi.org/10.1109/CVPR.2015.7298698]
He B , Yang X and Kang L. 2022. ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization [EB/OL]. [2023-09-24]. https://arxiv.org/pdf/2203.15187.pdfhttps://arxiv.org/pdf/2203.15187.pdf
Huang, L, Wang, L and Li, H. 2022. Weakly supervised temporal ac- tion localization via representative snippet knowledge propaga- tion [EB/OL]. [2023-09-24]. https://arxiv.org/pdf/2203.02925.pdfhttps://arxiv.org/pdf/2203.02925.pdf
Islam A, Long C and Radke R J. 2021. A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization. New York, USA: AAAI: 1637-1645 [DOI:10.1609 /aaai.v35i2.16256http://dx.doi.org/10.1609/aaai.v35i2.16256]
Idrees H, Zamir A R and Jiang Y G. 2017. The thumos challenge on action recognition for videos “in the wild”. Computer Vision a- nd Image Understanding, 155:1-23
相关文章
相关作者
相关机构