结合坐标转换和时空信息注入的点云人体行为识别

尤凯军; 侯振杰; 梁久祯; 钟卓锟; 施海勇

doi:10.11834/jig.230215

图像理解和计算机视觉 | 浏览量 : 0 下载量: 6 CSCD: 0

PDF
导出
分享
收藏
专辑

结合坐标转换和时空信息注入的点云人体行为识别
Point cloud human behavior recognition based on coordinate transformation and spatiotemporal information injection
2024年29卷第4期页码：1056-1069
纸质出版日期： 2024-04-16 ，
DOI： 10.11834/jig.230215
稿件说明：

移动端阅览

尤凯军，侯振杰，梁久祯，钟卓锟，施海勇. 2024. 结合坐标转换和时空信息注入的点云人体行为识别. 中国图象图形学报， 29(04):1056-1069

You Kaijun， Hou Zhenjie， Liang Jiuzhen， Zhong Zhuokun， Shi Haiyong. 2024. Point cloud human behavior recognition based on coordinate transformation and spatiotemporal information injection. Journal of Image and Graphics， 29(04):1056-1069
尤凯军，侯振杰，梁久祯，钟卓锟，施海勇. 2024. 结合坐标转换和时空信息注入的点云人体行为识别. 中国图象图形学报， 29(04):1056-1069 DOI： 10.11834/jig.230215.

You Kaijun， Hou Zhenjie， Liang Jiuzhen， Zhong Zhuokun， Shi Haiyong. 2024. Point cloud human behavior recognition based on coordinate transformation and spatiotemporal information injection. Journal of Image and Graphics， 29(04):1056-1069 DOI： 10.11834/jig.230215.

摘要

目的

行为识别中广泛使用的深度图序列存在着行为数据时空结构信息体现不足、易受深色物体等因素影响的缺点，点云数据可以提供丰富的空间信息与几何特征，弥补了深度图像的不足，但多数点云数据集规模较小且没有时序信息。为了提高时空结构信息的利用率，本文提出了结合坐标转换和时空信息注入的点云人体行为识别网络。

方法

通过将深度图序列转换为三维点云序列，弥补了点云数据集规模较小的缺点，并加入帧的时序概念。本文网络由两个模块组成，即特征提取模块和时空信息注入模块。特征提取模块提取点云深层次的外观轮廓特征。时空信息注入模块为轮廓特征注入时序信息，并通过一组随机张量投影继续注入空间结构信息。最后，将不同层次的多个特征进行聚合，输入到分类器中进行分类。

结果

在3个公共数据集上对本文方法进行了验证，提出的网络结构展现出了良好的性能。其中，在NTU RGB+d60数据集上的精度分别比PSTNet（point spatio-temporal network）和SequentialPointNet提升了1.3%和0.2%，在NTU RGB+d120数据集上的精度比PSTNet提升了1.9%。为了确保网络模型的鲁棒性，在MSR Action3D小数据集上进行实验对比，识别精度比SequentialPointNet提升了1.07%。

结论

提出的网络在获取静态的点云外观轮廓特征的同时，融入了动态的时空信息，弥补了特征提取时下采样导致的时空损失。

Abstract

Objective

Human motion recognition and deep learning have become a research hotspot in the field of computer vision because of their extensive applications in video surveillance， virtual reality， and human computer intelligent interaction. Deep learning theory has made excellent achievements in the feature extraction of static images and has been gradually extended to the research of behavior recognition in other directions. Traditional research on human behavior recognition focuses on depth image sequence under 2D information. Depth image cannot only capture 3D information successfully， but can also provide depth information. Depth information represents the distance between the target and the depth camera within the visual range， disregarding the influence of external factors， such as lighting and background. Although depth image can capture 3D information， most depth image algorithms use the multi-view method to extract behavior features. The extraction effect of spatiotemporal features is affected by the angle and number of multiple views， considerably affecting the utilization rate of 3D structural information， and the spatiotemporal structure information of 3D data is largely lost. With the rapid development of 3D acquisition technology， 3D sensors are becoming increasingly accessible and affordable， including various types of 3D scanners and LiDAR. The 3D data collected by these sensors can provide rich geometry， shape， and scale information. 3D data have many applications in different fields， including autonomous driving， robotics， remote sensing， and healthcare. Point cloud representation is a commonly used 3D representation； it retains the original geometric information in 3D space without any discretization. Therefore， it is the preferred representation for understanding related applications in many scenarios， such as autonomous driving and robotics. However， the deep learning of a 3D point cloud still faces major challenges， such as small dataset size.

Method

In this study， the depth map sequence is first converted into a 3D point cloud sequence to represent human behavior information， and the large and authoritative datasets in the depth dataset are converted into point cloud datasets to compensate for the shortcoming of the small size of point cloud datasets. Given the huge amount of point cloud data， the traditional point cloud deep learning network will use a sampling algorithm to sample the point cloud before feature extraction. The most commonly used algorithm is random subsampling， which will inevitably lead to the destruction of point cloud structural information. To improve the utilization rate of temporal and spatial structure information and compensate for the loss of such information during the random subsampling of a point cloud， a point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection is proposed for motion recognition in this study. The network consists of two modules： the feature extraction module and the spatiotemporal information injection module. The feature extraction module extracts the deep appearance contour features of the point cloud through operations， such as the abstraction manipulation layer， multilayer perceptron， and maximum pooling. Among which， the abstraction manipulation layer includes the sampling， grouping， convolutional block attention module（CBAM）， and PointNet layers. In the spatiotemporal information injection module， time sequence and spatial structure information are injected for abstract features. When timing information is injected， the sine and cosine functions of different frequencies are used as time position coding， because sine and cosine functions are unique and robust in the position of each vector in the disordered direction. During spatial structure information injection， the abstract features after location coding are multiplied with a group of learnable normal distribution random tensors and projected onto the corresponding dimension space. Then， the coefficients of the random tensors are learned through the network to find the optimal projection space that can better focus on the structural relations between point clouds. Subsequently， the feature enters the interpoint attention mechanism module to further learn the structural relationship between point cloud data points and points through the interpoint attention mechanism. Finally， the multilevel features in feature extraction and information injection are aggregated and inputted into the classifier for classification.

Result

A large number of experiments are performed on three common datasets， and the proposed network structure exhibits good performance. Accuracy on the NTU RGB+d60 datasets is 1.3% and 0.2% higher than those of PSTNet and SequentialPointNet， respectively， considerably exceeding the recognition accuracy of other networks. Although the accuracy of the NTU RGB+d120 dataset is 0.1% lower than that of SequentialPointNet， it remains in a leading position compared with other networks. The network recognition accuracy proposed in this study is 1.9% higher than that of PSTNet. The NTU dataset is one of the largest human action datasets. To ensure the robustness of the network model， the effect of the point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection on small datasets is verified， and experimental comparison was performed on small datasets of MSR Action3D. The recognition accuracy of the network proposed in this study was 1.07% higher than that of SequentialPointNet， and considerably higher than those of other networks.

Conclusion

In this study， we propose a point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal information injection for behavior recognition. Through coordinate transformation， the depth map sequence is converted into 3D point cloud sequence for the characterization of human behavior information， compensating for the shortcomings of insufficient depth information， spatial information， and geometric features， and improving the utilization rate of spatiotemporal structure information. The network proposed in this study not only obtains static point cloud contour features， but also integrates dynamic temporal and spatial information to compensate for the temporal and spatial losses caused by sampling during feature extraction.

关键词

人体行为识别坐标转换点云序列特征提取时空信息

Keywords

human behavior recognitioncoordinate transformationpoint cloud sequencefeature extractionspatiotemporal information

references

Cheng K， Zhang Y F， He X Y， Chen W H， Cheng J and Lu H Q. 2020. Skeleton-based action recognition with shift graph convolutional network//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 180-189 ［DOI： 10.1109/CVPR42600.2020.00026http://dx.doi.org/10.1109/CVPR42600.2020.00026］

Chi H G， Ha M H， Chi S， Lee S W， Huang Q X and Ramani K. 2022. InfoGCN： representation learning for human skeleton-based action recognition//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 20154-20164 ［DOI： 10.1109/CVPR52688.2022.01955http://dx.doi.org/10.1109/CVPR52688.2022.01955］

Fan H H， Yang Y and Kankanhalli M. 2021. Point 4D Transformer networks for spatio-temporal modeling in point cloud videos//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 14199-14208 ［DOI： 10.1109/CVPR46437.2021.01398http://dx.doi.org/10.1109/CVPR46437.2021.01398］

Fan H H， Yu X， Ding Y H， Yang Y and Kankanhalli M S. 2022. PSTNet： point spatio-temporal convolution on point cloud sequences ［EB/OL］. ［2023-02-04］. https://arxiv.org/pdf/2205.13713.pdfhttps://arxiv.org/pdf/2205.13713.pdf

Guo M H， Cai J X， Liu Z N， Mu T J， Martin R R and Hu S M. 2021a. PCT： point cloud Transformer. Computational Visual Media， 7（2）： 187-199 ［DOI： 10.1007/s41095-021-0229-5http://dx.doi.org/10.1007/s41095-021-0229-5］

Guo Y L， Wang H Y， Hu Q Y， Liu H， Liu L and Bennamoun M. 2021b. Deep learning for 3D point clouds： a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（12）： 4338-4364 ［DOI： 10.1109/TPAMI.2020.3005434http://dx.doi.org/10.1109/TPAMI.2020.3005434］

Kläser A， Marszałek M and Schmid C. 2008. A spatio-temporal descriptor based on 3D-gradients//Proceedings of the British Machine Vision Conference. Leeds， UK： BMVC ［DOI： 10.5244/C.22.99http://dx.doi.org/10.5244/C.22.99］

Korban M and Li X. 2020. DDGCN： a dynamic directed graph convolutional network for action recognition//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 761-776 ［DOI： 10.1007/978-3-030-58565-5_45http://dx.doi.org/10.1007/978-3-030-58565-5_45］

Li L G， Wang M S， Ni B B， Wang H， Yang J C and Zhang W J. 2021a. 3D human action representation learning via cross-view consistency pursuit//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 4739-4748 ［DOI： 10.1109/CVPR46437.2021.00471http://dx.doi.org/10.1109/CVPR46437.2021.00471］

Li M S， Chen S H， Chen X， Zhang Y， Wang Y F and Tian Q. 2021b. Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（6）： 3316-3333 ［DOI： 10.1109/TPAMI.2021.3053765http://dx.doi.org/10.1109/TPAMI.2021.3053765］

Li W Q， Zhang Z Y and Liu Z C. 2010. Action recognition based on a bag of 3D points//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops. San Francisco， USA： IEEE： 9-14 ［DOI： 10.1109/CVPRW.2010.5543273http://dx.doi.org/10.1109/CVPRW.2010.5543273］

Li X， Hou Z J， Liang J Z and Chang X Z. 2019. Bi-directional removal of reverse gravitational acceleration based on data segmentation. Journal of Computer-Aided Design and Computer Graphics， 31（4）： 560-572

李兴，侯振杰，梁久祯，常兴治. 2019. 分段双向去除反向重力加速度算法. 计算机辅助设计与图形学学报， 31（4）： 560-572 ［DOI： 10.3724/SP.J.1089.2019.17344http://dx.doi.org/10.3724/SP.J.1089.2019.17344］

Li X， Huang Q， Wang Z J， Hou Z J and Yang T J. 2022. SequentialPointNet： a strong parallelized point cloud sequence classification network for 3D action recognition ［EB/OL］. ［2023-02-04］. https://arxiv.org/pdf/2111.08492v1.pdfhttps://arxiv.org/pdf/2111.08492v1.pdf

Li X， Huang Q， Zhang Y F， Yang T J and Wang Z J. 2023. PointMapNet： point cloud feature map network for 3D human action recognition. Symmetry， 15（2）： #363 ［DOI： 10.3390/sym15020363http://dx.doi.org/10.3390/sym15020363］

Liu J， Shahroudy A， Perez M， Wang G， Duan L Y and Kot A C. 2020a. NTU RGB+d120： a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence， 42（10）： 2684-2701 ［DOI： 10.1109/TPAMI.2019.2916873http://dx.doi.org/10.1109/TPAMI.2019.2916873］

Liu J H， Guo J Y and Xu D. 2022. GeometryMotion-Transformer： an end-to-end framework for 3D action recognition. IEEE Transactions on Multimedia， 25： 5649-5661 ［DOI： 10.1109/TMM.2022.3198011http://dx.doi.org/10.1109/TMM.2022.3198011］

Liu X Y， Yan M Y and Bohg J. 2019. MeteorNet： deep learning on dynamic 3D point cloud sequences//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （CVPR）. Seoul， Korea （South）： IEEE： 9245-9254 ［DOI： 10.1109/ICCV.2019.00934http://dx.doi.org/10.1109/ICCV.2019.00934］

Liu Z Y， Zhang H W， Chen Z H， Wang Z Y and Ouyang W L. 2020b. Disentangling and unifying graph convolutions for skeleton-based action recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 140-149 ［DOI： 10.1109/CVPR42600.2020.00022http://dx.doi.org/10.1109/CVPR42600.2020.00022］

Qi C R， Su H， Mo K C and Guibas L J. 2017a. PointNet： deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 77-85 ［DOI： 10.1109/CVPR.2017.16http://dx.doi.org/10.1109/CVPR.2017.16］

Qi C R， Yi L， Su H and Guibas L J. 2017b. PointNet++： deep hierarchical feature learning on point sets in a metric space//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 5105-5114

S􀅡nchez-Caballero A， de López-Diz S， Fuentes-Jimenez D， Losada-Gutiérrez C， Marrón-Romera M， Casillas-Pérez D and Sarker M I. 2022. 3DFCNN： real-time action recognition using 3D deep neural networks with raw depth information. Multimedia Tools and Applications， 81（17）： 24119-24143 ［DOI： 10.1007/s11042-022-12091-zhttp://dx.doi.org/10.1007/s11042-022-12091-z］

Sanchez-Caballero A， Fuentes-Jimenez D and Losada-Gutiérrez C. 2020. Exploiting the ConvLSTM： human action recognition using raw depth video-based recurrent neural networks ［EB/OL］. ［2023-02-04］. http://arxiv.org/pdf/2006.07744.pdfhttp://arxiv.org/pdf/2006.07744.pdf

Shahroudy A， Liu J， Ng T T and Wang G. 2016. NTU RGB+d： a large scale dataset for 3D human activity analysis//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Las Vegas， USA： IEEE： 1010-1019 ［DOI： 10.1109/CVPR.2016.115http://dx.doi.org/10.1109/CVPR.2016.115］

Shi H Y， Hou Z J， Chao X and Zhong Z K. 2023. Multimodal spatial-temporal feature representation and its application in action recognition. Journal of Image and Graphics， 28（4）： 1041-1055

施海勇，侯振杰，巢新，钟卓锟. 2023. 多模态时空特征表示及其在行为识别中的应用. 中国图象图形学报， 28（4）： 1041-1055 ［DOI： 10.11834/jig.211217http://dx.doi.org/10.11834/jig.211217］

Shi L， Zhang Y F， Cheng J and Lu H Q. 2019. Skeleton-based action recognition with directed graph neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 7904-7913 ［DOI： 10.1109/CVPR.2019.00810http://dx.doi.org/10.1109/CVPR.2019.00810］

Song Y F， Zhang Z， Shan C F and Wang L. 2022a. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 45（2）： 1474-1488 ［DOI： 10.1109/TPAMI.2022.3157033http://dx.doi.org/10.1109/TPAMI.2022.3157033］

Song Y P， He F Z， Duan Y S， Si T Z and Bai J W. 2022b. LSLPCT： an enhanced local semantic learning Transformer for 3-D point cloud analysis. IEEE Transactions on Geoscience and Remote Sensing， 60： #4708813 ［DOI： 10.1109/TGRS.2022.3202823http://dx.doi.org/10.1109/TGRS.2022.3202823］

Tao S B， Liang C， Jiang T P， Yang Y J and Wang Y J. 2021. Sparse voxel pyramid neighborhood construction and classification of LiDAR point cloud. Journal of Image and Graphics， 26（11）： 2703-2712

陶帅兵，梁冲，蒋腾平，杨玉娇，王永君. 2021. 激光点云的稀疏体素金字塔邻域构建与分类. 中国图象图形学报， 26（11）： 2703-2712 ［DOI： 10.11834/jig.200262http://dx.doi.org/10.11834/jig.200262］

Vieira A W， Nascimento E R， Oliveira G L， Liu Z C and Campos M F M. 2012. STOP： space-time occupancy patterns for 3D action recognition from depth map sequences//Proceedings of the 17th Iberoamerican Congress. Buenos Aires， Argentina： Springer： 252-259 ［DOI： 10.1007/978-3-642-33275-3_31http://dx.doi.org/10.1007/978-3-642-33275-3_31］

Wang J， Liu Z C， Wu Y and Yuan J S. 2012. Mining actionlet ensemble for action recognition with depth cameras//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence， USA： IEEE： 1290-1297 ［DOI： 10.1109/CVPR.2012.6247813http://dx.doi.org/10.1109/CVPR.2012.6247813］

Wang Y C， Xiao Y， Xiong F， Jiang W X， Cao Z G， Zhou J T and Yuan J S. 2020. 3DV： 3D dynamic voxel for action recognition in depth video//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 508-517 ［DOI： 10.1109/CVPR42600.2020.00059http://dx.doi.org/10.1109/CVPR42600.2020.00059］

Xiang W M， Li C， Zhou Y X， Wang B and Zhang L. 2023. Language action description prompts for skeleton-based action recognition. ［EB/OL］. ［2023-09-06］. http://arxiv.org/pdf/2208.05318.pdfhttp://arxiv.org/pdf/2208.05318.pdf

Xiao Y， Chen J， Wang Y C， Cao Z G， Zhou J T and Bai X. 2019. Action recognition for depth video using multi-view dynamic images. Information Sciences， 480： 287-304 ［DOI： 10.1016/j.ins.2018.12.050http://dx.doi.org/10.1016/j.ins.2018.12.050］

Xu M T， Ding R Y， Zhao H S and Qi X J. 2021. PAConv： position adaptive convolution with dynamic kernel assembling on point clouds//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 3172-3181 ［DOI： 10.1109/CVPR46437.2021.00319http://dx.doi.org/10.1109/CVPR46437.2021.00319］

Xu Y， Hou Z J， Liang J Z， Chen C， Jia L and Song Y. 2018. Action recognition using weighted fusion of depth images and skeleton’s key frames. Journal of Computer-Aided Design and Computer Graphics， 30（7）： 1313-1320

许艳，侯振杰，梁久祯，陈宸，贾靓，宋毅. 2018. 权重融合深度图像与骨骼关键帧的行为识别. 计算机辅助设计与图形学学报， 30（7）： 1313-1320 ［DOI： 10.3724/SP.J.1089.2018.16771http://dx.doi.org/10.3724/SP.J.1089.2018.16771］

Yan S J， Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and the 30th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans， USA： AAAI： 7444-7452

文章被引用时，请邮件提醒。

提交

智能交通系统中的车辆标志识别方法综述

面向虚拟视点绘制空洞填充的渐进式迭代网络

面向RGB-D图像的多层特征提取算法综述

融合时序特征约束与联合优化的点云3维人体姿态序列估计

多尺度代价体信息共享的多视角立体重建网络