结合扩张卷积与多尺度融合的实时时空动作检测
Combining dilated convolution and multiscale fusion temporal action detection
- 2025年30卷第2期 页码:406-420
纸质出版日期: 2025-02-16
DOI: 10.11834/jig.240098
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2025-02-16 ,
移动端阅览
程勇, 高园元, 王军, 杨玲, 许小龙, 程遥, 张开华. 2025. 结合扩张卷积与多尺度融合的实时时空动作检测. 中国图象图形学报, 30(02):0406-0420
Cheng Yong, Gao Yuanyuan, Wang Jun, Yang Ling, Xu Xiaolong, Cheng Yao, Zhang Kaihua. 2025. Combining dilated convolution and multiscale fusion temporal action detection. Journal of Image and Graphics, 30(02):0406-0420
目的
2
时空动作检测任务旨在预测视频片段中所有动作的时空位置及对应类别。然而,现有方法大多关注行动者的视觉和动作特征,忽视与行动者交互的全局上下文信息。针对当前方法的不足,提出一种结合扩张卷积与多尺度融合的高效时空动作检测模型(efficient action detector,EAD)。
方法
2
首先,利用轻量级双分支网络同时建模关键帧的静态信息和视频片段的动态时空信息。其次,利用分组思想构建轻量空间扩张增强模块提取全局性的上下文信息。然后,构建多种DO-Conv结构组成的多尺度特征融合单元,实现多尺度特征捕获与融合。最后,将不同层次的特征分别送入预测头中进行检测。
结果
2
实验在数据集UCF101-24和AVA(atomic visual actions)中进行,分析了EAD与现有算法之间的检测对比结果。在UCF101-24数据集上的帧平均准确度(frame-mAP)和视频平均准确度(video-mAP)分别为80.93%和50.41%,对于基线方法的漏检、错检现象有所改善;在AVA数据集上的frame-mAP达到15.92%,同时保持较低的计算开销。
结论
2
通过与基线及目前主流方法比较,EAD以较低的计算成本建模全局关键信息,提高了实时动作检测准确度。
Objective
2
Spatial-temporal action detection (STAD) represents a significant challenge in the field of video understanding. The objective is to identify the temporal and spatial localization of actions occurring in a video and categorize related action classes. The majority of existing methods rely on the backbone for the feature modeling of video clips, which captures only local features and ignores the global contextual information of the interaction with the actors. The results are represented by a model that cannot fully comprehend the nuances of the entire scene. The current mainstream methods for real-time STAD tasks are dual-stream network-based methods. However, a simple channel-by-channel connection is typically employed to handle dual-branch network fusion, which results in a significant redundancy of the fused features and certain semantic differences in the branch features. This scheme affects the accuracy of the model. Here, an efficient STAD model called the efficient action detector (EAD), which can address the shortcomings of current methods, is proposed.
Method
2
The EAD model consists of three key components: the 2D branch, the 3D branch, and the fusion head. Among them, the 2D branch consists of a pretrained 2D backbone network, feature pyramid, and decoupling head; the 3D branch consists of a 3D backbone network and augmentation module; and the fusion head consists of a multiscale feature fusion unit (MSFFU) and a prediction head. First, key frames are extracted from the video clips and fed into the pretrained 2D branch backbone (YOLOv7) to detect the actors in the scene and obtain spatially decoupled features, which are classification features and localization features. Video spatial-temporal features are extracted from video clips via a pretrained lightweight video backbone network (Shufflenetv2). Second, the lightweight spatial dilated augmented module (LSDAM) uses the grouping idea to address spatial-temporal features, which serves to save resources. LSDAM consists of a dilated module (DM) and a spatial augmented module (SAM). The DM employs dilatation convolution with different dilation rates, which fully aggregates contextual feature information and reduces the loss of spatial resolution. The SAM takes the key information in the global features captured via the DM to focus and enhance the expression of the target features. The LSDAM receives the spatial-temporal features and sends them first to the DM to expand the sensory field, then subsequently to the SAM to extract the key information, and finally to the global, low-noise contextual information. Then, the enhanced features are dimensionally aligned with the spatially separated features and fed into the MSFFU for feature fusion. The MSFFU module refines the feature information via multiscale fusion and reduces the redundancy of the fused features, which enables the model to better understand the information in the whole scene. The MSFFU performs multiple levels of feature extraction for the double-branching features by utilizing different DO-Conv structures, and the individual MSFFU uses different DO-Conv structures to extract features from the dual-branch features at multiple levels, integrates each branch via element-by-element multiplication or addition, and then filters the irrelevant information in the features via a convolution operation. DO-Conv can accelerate the convergence of the network and improve the generalizability of the model, thereby improving the training speed of the model. Finally, the features at different levels are fed into the anchorless frame-based prediction head for STAD.
Result
2
Comparative detection results between EAD and existing algorithms were analyzed for the public datasets UCF101-24 and atomic visual actions (AVA) version 2.2. For the UCF101-24 dataset, frame-mAP, video-mAP, frame per second(FPS), GFLOPs, and Params are used as evaluation metrics to assess the accuracy and spatial-temporal complexity of the model. For the AVA dataset, frame-mAP and GFLOPs are used as evaluation metrics. In this work, ablation experiments on the EAD model are performed on the UCF101-24 dataset. The frame-mAP is 79.52%, and the video-mAP is 49.29% after the addition of the MSFFU, which are improvements of 0.41% and 0.14%, respectively, from the baseline. The frame-mAP is 80.96%, and the video-mAP is 49.72% after the addition of the LSDAM, which are improvements of 1.85% and 0.57%, respectively, from the baseline. The final model of the EAD frame-mAP is 80.93%, the video-mAP is 50.41%, the FPS is 53 f/s, the number of GFLOPs is 2.77, and the number of parameters is 10.92 M, which are improvements of 1.82% and 1.26%, respectively, compared with the baseline frame-averaged accuracy and video-averaged accuracy. The leakage and misdetection phenomenon of the baseline method also improved. In addition, EAD is compared with existing real-time STAD algorithms, in which the frame-mAP is improved by 0.53% and 0.43%, and the GFLOPs are reduced by 40.93 and 0.13 compared with those of YOWO and YOWOv2, respectively. On the AVA dataset, the frame-mAP and GFLOPs reach 13.74% and 2.77, respectively, for an input frame count of 16; moreover, the frame-mAP and GFLOPs reach 15.92% and 4.4%, respectively, for an input frame count of 32. Compared with other mainstream methods, EAD uses a lighter backbone network to achieve lower computational costs while achieving impressive results.
Conclusion
2
This study proposes an STAD model called EAD, which is based on a two-stream network, to address the problems of missing global contextual information about actors’ interactions and the poor characterization of fused features. The results of the experimental results of the proposed model on the UCF101-24 and AVA datasets verify its robustness and effectiveness in STAD tasks by comparing it with the baseline and current mainstream methods. The proposed model can also be applied to the fields of intelligent monitoring, automatic driving, intelligent medical care, and other fields.
Cao J M , Li Y Y , Sun M C , Chen Y , Lischinski D , Cohen-Or D , Chen B Q and Tu C H . 2022 . DO-Conv: depthwise over-parameterized convolutional layer . IEEE Transactions on Image Processing , 31 : 3726 - 3736 [ DOI: 10.1109/TIP.2022.3175432 http://dx.doi.org/10.1109/TIP.2022.3175432 ]
Chen S F , Sun P Z , Xie E Z , Ge C J , Wu J N , Ma L , Shen J J and Luo P . 2021 . Watch only once: an end-to-end video action detection framework // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 8158 - 8167 [ DOI: 10.1109/ICCV48922.2021.00807 http://dx.doi.org/10.1109/ICCV48922.2021.00807 ]
Cheng Y , Wang W , Ren Z P , Zhao Y F , Liao Y L , Ge Y , Wang J , He J X , Gu Y K , Wang Y X , Zhang W J and Zhang C . 2023 . Multi-scale feature fusion and Transformer network for urban green space segmentation from high-resolution remote sensing images . International Journal of Applied Earth Observation and Geoinformation , 124 : # 103514 [ DOI: 10.1016/j.jag.2023.103514 http://dx.doi.org/10.1016/j.jag.2023.103514 ]
Dai Y M , Gieseke F , Oehmcke S , Wu Y Q and Barnard K . 2021 . Attentional feature fusion // Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision . Waikoloa, USA : IEEE: 3559 - 3568 [ DOI: 10.1109/WACV48630.2021.00360 http://dx.doi.org/10.1109/WACV48630.2021.00360 ]
Feichtenhofer C . 2020 . X 3 D: expanding architectures for efficient video recognition// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA: IEEE: 200 - 210 [ DOI: 10.1109/CVPR42600.2020.00028 http://dx.doi.org/10.1109/CVPR42600.2020.00028 ]
Feichtenhofer C , Fan H Q , Malik J and He K M . 2019 . SlowFast networks for video recognition // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul, Korea (South) : IEEE: 6201 - 6210 [ DOI: 10.1109/ICCV.2019.00630 http://dx.doi.org/10.1109/ICCV.2019.00630 ]
Gan C , Wang N Y , Yang Y , Yeung D Y and Hauptmann A G . 2015 . DevNet: a deep event network for multimedia event detection and evidence recounting // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston, USA : IEEE: 2568 - 2577 [ DOI: 10.1109/CVPR.2015.7298872 http://dx.doi.org/10.1109/CVPR.2015.7298872 ]
Gevorgyan Z . 2022 . SIoU loss: more powerful learning for bounding box regression . [EB/OL]. [ 2024-02-22 ]. https://arxiv.org/pdf/2205.12740.pdf https://arxiv.org/pdf/2205.12740.pdf
Gkioxari G and Malik J . 2015 . Finding action tubes // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston, USA : IEEE: 759 - 768 [ DOI: 10.1109/CVPR.2015.7298676 http://dx.doi.org/10.1109/CVPR.2015.7298676 ]
Gu C H , Sun C , Ross D A , Vondrick C , Pantofaru C , Li Y Q , Vijayanarasimhan S , Toderici G , Ricco S , Sukthankar R , Schmid C and Malik J . 2018 . AVA: a video dataset of spatio-temporally localized atomic visual actions // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 6047 - 6056 [ DOI: 10.1109/CVPR.2018.00633 http://dx.doi.org/10.1109/CVPR.2018.00633 ]
He C J , Liu Q Y and Wang Z L . 2023 . Modeling interaction and profiling dependency-relevant video action detection . Journal of Image and Graphics , 28 ( 5 ): 1499 - 1512
贺楚景 , 刘钦颖 , 王子磊 . 2023 . 建模交互关系和类别依赖的视频动作检测 . 中国图象图形学报 , 28 ( 5 ): 1499 - 1512 [ DOI: 10.11834/jig.211040 http://dx.doi.org/10.11834/jig.211040 ]
Hou R , Chen C and Shah M . 2017 . Tube convolutional neural network (T-CNN) for action detection in videos // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice, Italy : IEEE: 5823 - 5832 [ DOI: 10.1109/ICCV.2017.620 http://dx.doi.org/10.1109/ICCV.2017.620 ]
Kalogeiton V , Weinzaepfel P , Ferrari V and Schmid C . 2017 . Action tubelet detector for spatio-temporal action localization // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice, Italy : IEEE: 4415 - 4423 [ DOI: 10.1109/ICCV.2017.472 http://dx.doi.org/10.1109/ICCV.2017.472 ]
Köpüklü O , Wei X Y and Rigoll G . 2019 . You only watch once: a unified CNN architecture for real-time spatiotemporal action localization [EB/OL]. [ 2024-02-22 ]. https://arxiv.org/pdf/1911.06644.pdf https://arxiv.org/pdf/1911.06644.pdf
Li Y X , Wang Z X , Wang L M and Wu G S . 2020 . Actions as moving points // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 68 - 84 [ DOI: 10.1007/978-3-030-58517-4_5 http://dx.doi.org/10.1007/978-3-030-58517-4_5 ]
Ma N N , Zhang X Y , Zheng H T and Sun J . 2018 . ShuffleNet V2: practical guidelines for efficient CNN architecture design // Proceedings of the 15th European Conference on Computer Vision . Munich, Germany : Springer: 122 - 138 [ DOI: 10.1007/978-3-030-01264-9_8 http://dx.doi.org/10.1007/978-3-030-01264-9_8 ]
Ma X R , Luo Z G , Zhang X , Liao Q , Shen X Y and Wang M Z . 2021 . Spatio-temporal action detector with self-attention // Proceedings of 2021 International Joint Conference on Neural Networks . Shenzhen, China : IEEE: 1 - 8 [ DOI: 10.1109/IJCNN52387.2021.9533300 http://dx.doi.org/10.1109/IJCNN52387.2021.9533300 ]
Pan J T , Chen S Y , Shou M Z , Liu Y , Shao J and Li H S . 2021 . Actor-context-actor relation network for spatio-temporal action localization // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 464 - 474 [ DOI: 10.1109/CVPR46437.2021.00053 http://dx.doi.org/10.1109/CVPR46437.2021.00053 ]
Song L , Zhang S W , Yu G and Sun H B . 2019 . TACNet: transition-aware context network for spatio-temporal action detection // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 11979 - 11987 [ DOI: 10.1109/CVPR.2019.01226 http://dx.doi.org/10.1109/CVPR.2019.01226 ]
Soomro K , Zamir A R and Shah M . 2012 . UCF101: a dataset of 101 human actions classes from videos in the wild [EB/OL]. [ 2024-02-22 ]. https://arxiv.org/pdf/212.0402.pdf https://arxiv.org/pdf/212.0402.pdf
Sui L , Zhang C L , Gu L X and Han F . 2023 . A simple and efficient pipeline to build an end-to-end spatial-temporal action detector // Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa, USA : IEEE: 5988 - 5997 [ DOI: 10.1109/WACV56688.2023.00594 http://dx.doi.org/10.1109/WACV56688.2023.00594 ]
Sun C , Shrivastava A , Vondrick C , Murphy K , Sukthankar R and Schmid C . 2018 . Actor-centric relation network // Proceedings of the 15th European Conference on Computer Vision . Munich, Germany : Springer: 335 - 351 [ DOI: 10.1007/978-3-030-01252-6_20 http://dx.doi.org/10.1007/978-3-030-01252-6_20 ]
Venugopalan S , Rohrbach M , Donahue J , Mooney R , Darrell T and Saenko K . 2015 . Sequence to sequence-video to text // Proceedings of 2015 IEEE International Conference on Computer Vision . Santiago, Chile : IEEE: 4534 - 4542 [ DOI: 10.1109/ICCV.2015.515 http://dx.doi.org/10.1109/ICCV.2015.515 ]
Wang D Q and Zhao X . 2022 . Class-aware network with global temporal relations for video action detection . Journal of Image and Graphics , 27 ( 12 ): 3566 - 3580
王东祺 , 赵旭 . 2022 . 类别敏感的全局时序关联视频动作检测 . 中国图象图形学报 , 27 ( 12 ): 3566 - 3580 [ DOI: 10.11834/jig.211096 http://dx.doi.org/10.11834/jig.211096 ]
Wang L M , Qiao Y , Tang X O and Van Gool L . 2016 . Actionness estimation using hybrid fully convolutional networks // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas, USA : IEEE: 2708 - 2717 [ DOI: 10.1109/CVPR.2016.296 http://dx.doi.org/10.1109/CVPR.2016.296 ]
Wang P Q , Chen P F , Yuan Y , Liu D , Huang Z H , Hou X D and Cottrell G . 2018 . Understanding convolution for semantic segmentation // Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision . Lake Tahoe, USA : IEEE: 1451 - 1460 [ DOI: 10.1109/WACV.2018.00163 http://dx.doi.org/10.1109/WACV.2018.00163 ]
Wu T , Cao M Q , Gao Z T , Wu G S and Wang L M . 2023 . STMixer: a one-stage sparse action detector // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 14720 - 14729 [ DOI: 10.1109/CVPR52729.2023.01414 http://dx.doi.org/10.1109/CVPR52729.2023.01414 ]
Yang J H and Dai K . 2023 . YOWOv2: a stronger yet efficient multi-level detection framework for real-time spatio-temporal action detection [EB/OL]. [ 2024-02-22 ]. http://arxiv.org/pdf/2302.06848.pdf http://arxiv.org/pdf/2302.06848.pdf
Zhao J J , Zhang Y Y , Li X Y , Chen H , Shuai B , Xu M Z , Liu C H , Kundu K , Xiong Y J , Modolo D , Marsic I , Snoek C G M and Tighe J . 2022 . TubeR: tubelet Transformer for video action detection // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, USA : IEEE: 13588 - 13597 [ DOI: 10.1109/CVPR52688.2022.01323 http://dx.doi.org/10.1109/CVPR52688.2022.01323 ]
相关作者
相关机构