标记分布与时空注意力感知的视频动作质量评估

张宇; 徐天宇; 米思娅

doi:10.11834/jig.221074

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

标记分布与时空注意力感知的视频动作质量评估
Label distribution learning and spatio-temporal attentional awareness for video action quality assessment
2023年28卷第12期页码：3810-3824
纸质出版日期： 2023-12-16 ，
DOI： 10.11834/jig.221074
稿件说明：

移动端阅览

张宇，徐天宇，米思娅. 2023. 标记分布与时空注意力感知的视频动作质量评估. 中国图象图形学报， 28(12):3810-3824

Zhang Yu， Xu Tianyu， Mi Siya. 2023. Label distribution learning and spatio-temporal attentional awareness for video action quality assessment. Journal of Image and Graphics， 28(12):3810-3824
张宇，徐天宇，米思娅. 2023. 标记分布与时空注意力感知的视频动作质量评估. 中国图象图形学报， 28(12):3810-3824 DOI： 10.11834/jig.221074.

Zhang Yu， Xu Tianyu， Mi Siya. 2023. Label distribution learning and spatio-temporal attentional awareness for video action quality assessment. Journal of Image and Graphics， 28(12):3810-3824 DOI： 10.11834/jig.221074.

摘要

目的

视频动作质量评估旨在评估视频中特定动作的执行情况和完成质量。自动化的动作质量评估能够有效地减少人力资源的损耗，可以更加精准、公正地对视频内容进行评估。传统动作质量评估方法主要存在以下问题：1）视频中动作主体的多尺度时空特征问题；2）认知差异导致的标记内在模糊性问题；3）多头自注意力机制的注意力头冗余问题。针对以上问题，提出了一种能够感知视频序列中不同时空位置、生成细粒度标记的动作质量评估模型SALDL（self-attention and label distribution learning）。

方法

SALDL提出Attention-Inc（attention-inception）结构，该结构通过Embedding、多头自注意力以及多层感知机将自注意力机制渐进式融入Inception结构，使模型能够获得不同尺度卷积特征之间的上下文信息。提出一种正负时间注意力模块PNTA（pos-neg temporal attention），通过PNTA损失挖掘时间注意力特征，从而减少自注意力头冗余并提取不同片段的注意力特征。SALDL模型通过标记增强及标记分布学习生成细粒度的动作质量标记。

结果

提出的SALDL模型在MTL-AQA（multitask learning-action quality assessment）和JIGSAWS（JHU-ISI gesture and skill assessment working set）等数据集上进行了大量对比及消融实验，斯皮尔曼等级相关系数分别为0.941 6和0.818 3。

结论

SALDL模型通过充分挖掘不同尺度的时空特征解决了多尺度时空特征问题，并引入符合标记分布的先验知识进行标记增强，达到了解决标记的内在模糊性问题以及注意力头的冗余问题。

Abstract

Objective

Video action quality assessment aims to evaluate the execution and completion quality of specific actions in a video. Automated action quality assessment can effectively reduce losses in human resources and generate accurate and fair evaluations of video content. Meanwhile， traditional video action quality assessment task methods mainly suffer from three problems. First， most of these methods exhibit problems involving multi-scale spatial and temporal features. Specifically， the spatial and temporal location of the action in a video is critical for action quality assessment， and the sample video contains much information unrelated to the action. Thus， the current video action quality assessment methods encounter multi-scale spatial feature issues， in which different videos may have varying subject scale sizes in the spatial dimension， thus introducing challenges in capturing action information. In addition， action quality assessment confronts problems involving multi-scale temporal features， in which different durations and execution rates may exist in the temporal dimension and where the correlations between various time segments and labels are different. Second， the existing methods ignore problems related to the inherent ambiguity of labels caused by cognitive differences. These methods tend to focus on individual score labels and ignore the inherent ambiguity of score labels， the possibility of different judges providing different scores， and the subjectivity behind the given scores. For example， diving scores are presented by seven different judges and are not determined by a single label. Third， the current attention mechanisms faces redundancy in their self-attention heads. For instance， previous studies have employed many self-attention mechanism heads， but these heads exhibit redundancy during training. Moreover， removing the majority of these heads does not significantly affect the model performance. Experiments show that increasing the number of heads only worsens the effect of action quality assessment. To address these problems， this paper proposes self-attention and label distribution learning （SALDL）， an action quality assessment model that focuses on different spatio-temporal locations in video sequences and generates fine-grained labels.

Method

This paper designs a new video action quality assessment model called SALDL that focuses on action information at different spatio-temporal locations in video sequences and generates fine-grained labels via the label distribution learning method， thus effectively addressing label ambiguity. SALDL comprises three main parts， namely， the video representation， pos-neg temporal attention （PNTA）， and label distribution learning （LDL） modules. In the video representation module， SALDL employs an inflated 3D ConvNet （I3D） network structure with multi-receptive field convolution kernels to extract the spatial features within video clips. This model also proposes an Attention-Inc module that utilizes embedding， multi-head self-attention （MHSA）， and multi-layer perceptron （MLP） to progressively incorporate the self-attentive mechanism into the Inception module， hence enabling the model to obtain contextual information between convolutional features at different scales. In the PNTA module， a temporal attention module with positive and negative attention heads is used to fully exploit temporal attention features through PNTA loss， thereby reducing the redundancy of self-attentive heads and extracting attention features from different time segments. In the LDL module， the SALDL model uses label distribution learning to generate fine-grained action quality labels， thereby resolving the inherent ambiguity of the tags. We also introduce a priori knowledge that the score label fits a certain distribution and then apply label enhancement methods to convert single labels into label distributions. The predicted label distribution is approximated via the Kullback-Leibler divergence loss function to the ground truth label distribution.

Result

Extensive comparison experiments are performed on the multitask learning-action quality assessment （MTL-AQA） and JHU-ISI gesture and skill assessment working set （JIGSAWS） datasets. The Spearman rank correlation coefficient （Sp.Corr） was 0.941 6 in the MTL-AQA datasets 0.836 4， 0.866 0， and 0.753 1， all of which achieved state-of-the-art results. Extensive ablation experiments were also performed for the PNTA， LDL， and Attention-Inc structures in the SALDL model. The experimental regression-based SALDL model， with the output dimension of the fully connected layer， changed to 1， and with the exclusion of the softmax function， SALDL directly generated a prediction score with an Sp.Corr of 0.932 0. SALDL-w/o PNTA， which represents the SALDL model without using the PNTA module， obtained an Sp.Corr of 0.938 4， while SALDL-w/o Attention-Ins， which represents the SALDL model without using the Attention-Inc structure， obtained an Sp.Corr of 0.939 9. Experimental results highlight the enhancement of each module for SALDL. We also conducted ablation experiments on the selection of a segmentation strategy and distribution function. Results show that the selection of a segmentation strategy and distribution function should be dynamic and in accordance with the dataset type. Therefore， future research should determine the ideal distribution function， the fusion of different distribution functions， and other methods to achieve adaptive label enhancement.

Conclusion

The proposed SALDL model addresses problems that involve multi-scale spatio-temporal features by fully mining spatio-temporal features at different scales. This model also solves the intrinsic ambiguity of labels and the redundancy of self-attention heads by introducing a priori knowledge where labels conform to a certain distribution for enhancement and achieve label distribution learning. The proposed SALDL model achieves state-of-the-art performance on several action quality assessment datasets， hence fully validating its effectiveness.

关键词

动作质量评估（AQA）Inception模块自注意力机制标记分布学习斯皮尔曼等级相关系数

Keywords

action quality assessment （AQA）Inception moduleself-attention mechanismlabel distribution learningSpearman rank correlation coefficient

references

Arnab A， Dehghani M， Heigold G， Sun C， Lučić M and Schmid C. 2021. ViViT： a video vision transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Montreal， Canada： IEEE： 6836-6846 ［DOI： 10.1109/ICCV48922.2021.00676http://dx.doi.org/10.1109/ICCV48922.2021.00676］

Doughty H， Mayol-Cuevas W and Damen D. 2019. The pros and cons： rank-aware temporal attention for skill determination in long videos//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 7862-7871 ［DOI： 10.1109/CVPR.2019.00805http://dx.doi.org/10.1109/CVPR.2019.00805］

Fan H， Xiong B， Mangalam K， Li Y， Yan Z， Malik J and Feichtenhofer C. 2019. Multiscale vision Transformers//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul， Korea （South）： IEEE： 6824-6835 ［DOI： 10.1109/ICCV48922.2021.00675http://dx.doi.org/10.1109/ICCV48922.2021.00675］

Feichtenhofer C， Fan H Q， Malik J and He K M. 2019. SlowFast networks for video recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul， Korea （South）： IEEE： 6202-6211 ［DOI： 10.1109/ICCV.2019.00630http://dx.doi.org/10.1109/ICCV.2019.00630］

Funke I， Mees S T， Weitz J and Speidel S. 2019. Video-based surgical skill assessment using 3D convolutional neural networks. International Journal of Computer-Assisted Radiology and Surgery， 14（7）： 1217-1225 ［DOI： 10.1007/s11548-019-01995-1http://dx.doi.org/10.1007/s11548-019-01995-1］

Gao Y， Vedula S， Reiley C E， Ahmidi N， Varadarajan B， Lin H C and Hager G D. 2014. Jhu-is gesture and skill assessment working set （JIGSAWS）： a surgical activity dataset for human motion modeling//MICCAI workshop： M2cai. 3： #3 ［DOI： 10.1007/978-3-319-10599-4］

Geng X and Xia Y. 2014. Head pose estimation based on multivariate label distribution//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Columbus， USA： IEEE： 1837-1842 ［DOI： 10.1109/CVPR.2014.237http://dx.doi.org/10.1109/CVPR.2014.237］

Geng X， Yin C and Zhou Z H. 2013. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence， 35（10）： 2401-2412 ［DOI： 10.1109/TPAMI.2013.51http://dx.doi.org/10.1109/TPAMI.2013.51］

Lin J， Cai Y， Hu X， Wang H， Yan Y， Zou X and Van Gool L. 2022. Flow-guided sparse Transformer for video deblurring ［EB/OL］. ［2022-05-29］. https://arxiv.org/pdf/2201.01893.pdfhttps://arxiv.org/pdf/2201.01893.pdf

Ling M G and Geng X. 2019. Indoor crowd counting by mixture of Gaussians label distribution learning. IEEE Transactions on Image Processing， 28（11）： 5691-5701 ［DOI： 10.1109/TIP.2019.2922818http://dx.doi.org/10.1109/TIP.2019.2922818］

Michel P， Levy O and Neubig G. 2019. Are sixteen heads really better than one?//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： #32

Pan J H， Gao J B and Zheng W S. 2019. Action assessment by joint relation graphs//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 6331-6340 ［DOI： 10.1109/ICCV.2019.00643http://dx.doi.org/10.1109/ICCV.2019.00643］

Parmar P and Tran Morris B. 2017. Learning to score Olympic events//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Honolulu， USA： IEEE： 20-28 ［DOI： 10.1109/CVPRW.2017.16http://dx.doi.org/10.1109/CVPRW.2017.16］

Parmar P and Tran Morris B. 2019a. What and how well you performed？ A multitask learning approach to action quality assessment//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 304-313 ［DOI： 10.1109/CVP R.2019.00039http://dx.doi.org/10.1109/CVPR.2019.00039］

Parmar P and Tran Morris B. 2019b. Action quality assessment across multiple actions//Proceedings of 2019 IEEE Winter Conference on Applications of Computer Vision （WACV）. Waikoloa， USA： IEEE： 1468-1476 ［DOI： 10.1109/WACV.2019.00161http://dx.doi.org/10.1109/WACV.2019.00161］

Pirsiavash H， Vondrick C and Torralba A. 2014. Assessing the quality of actions//Proceedings of the 13th European Conference on Computer Vision （ECCV）. Cham， Germany： Springer： 556-571 ［DOI： 10.1007/978-3-319-10599-4_36http://dx.doi.org/10.1007/978-3-319-10599-4_36］

Su K and Geng X. 2019. Soft facial landmark detection by label distribution learning//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu， USA： AAAI Press： 5008-5015 ［DOI： 10.1609/AAAI.v33i01.33015008http://dx.doi.org/10.1609/AAAI.v33i01.33015008］

Tang Y S， Ni Z L， Zhou J H， Zhang D Y， Lu J W， Wu Y and Zhou J. 2020. Uncertainty-aware score distribution learning for action quality assessment//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 6331-6340 ［DOI： 10.1109/CVPR42600.2020.00986http://dx.doi.org/10.1109/CVPR42600.2020.00986］

Wang S L， Yang D K， Zhai P， Chen C X and Zhang L H. 2021. TSA-Net： tube self-attention network for action quality assessment//Proceedings of the 29th ACM International Conference on Multimedia. Seoul， Korea （South）： IEEE： 4902-4910 ［DOI： 10.1145/3474085.3475438http://dx.doi.org/10.1145/3474085.3475438］

Wang L， Xiong Y， Wang Z， Qiao Y， Lin D， Tang X， and Van Gool L. 2016. Temporal segment networks： Towards good practices for deep action recognition//Proceedings of 2016 European Conference on Computer Vision （ECCV）. Amsterdam， the Netherlands： Springer： 20-36 ［DOI： 10.1007/978-3-319-46484-8_2http://dx.doi.org/10.1007/978-3-319-46484-8_2］

Wei C， Fan H， Xie S， Wu C Y， Yuille A， and Feichtenhofer C. 2022. Masked feature prediction for self-supervised visual pre-training//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE： 14668-14678 ［DOI： 10.1109/CVPR.2019.00630http://dx.doi.org/10.1109/CVPR.2019.00630］

Xu J L， Rao Y M， Yu X M， Chen G Y， Zhou J and Lu J W. 2022. FineDiving： a fine-grained dataset for procedure-aware action quality assessment//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE ［DOI： 10.1109/CVPR52688.2022.00296http://dx.doi.org/10.1109/CVPR52688.2022.00296］

Yan S， Xiong Y， and Lin D. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence ［DOI： 10.1609/aaai.v32i1.12328http://dx.doi.org/10.1609/aaai.v32i1.12328］

Yang C Y， Xu Y H， Shi J P， Dai B and Zhou B L. 2020. Temporal pyramid network for action recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 591-600 ［DOI： 10.1109/CVPR42600.2020.00067http://dx.doi.org/10.1109/CVPR42600.2020.00067］

Yang Q S and Mu T J. 2022. Action recognition using ensembling of different distillation-trained spatial-temporal graph convolution models. Journal of Image and Graphics， 27（4）： 1290-1301

杨清山，穆太江. 2022. 采用蒸馏训练的时空图卷积动作识别融合模型. 中国图象图形学报， 27（4）： 1290-1301 ［DOI： 10.11834/jig.200791http://dx.doi.org/10.11834/jig.200791］

Yu X M， Rao Y M， Zhao W L， Lu J W and Zhou J. 2021. Group-aware contrastive regression for action quality assessment//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （CVPR）. Montreal， Canada： IEEE： 7919-7928 ［DOI： 10.1109/ICCV48922.2021.00782http://dx.doi.org/10.1109/ICCV48922.2021.00782］

Zang Y， Liu T J， Zhao S G and Yang D S. 2022. Action recognition analysis derived of integer sparse network. Journal of Image and Graphics， 27（8）： 2404-2417

臧影，刘天娇，赵曙光，杨东升. 2022. 高性能整数倍稀疏网络行为识别研究. 中国图象图形学报， 27（8）： 2404-2417 ［DOI： 10.11834/jig.210087http://dx.doi.org/10.11834/jig.210087］

Zeng L A， Hong F T， Zheng W S， Yu Q Z， Zeng W， Wang Y W and Lai J H. 2020. Hybrid dynamic-static context-aware attention network for action assessment in long videos//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 2526-2534 ［DOI： 10.1145/3394171.3413560http://dx.doi.org/10.1145/3394171.3413560］

文章被引用时，请邮件提醒。

提交

结合双边交叉增强与自注意力补偿的点云语义分割

Transformer驱动的图像分类研究进展

融合自注意力和自编码器的视频异常检测

融合视觉词与自注意力机制的视频目标分割

多关键帧特征交互的人脸篡改视频检测