融合多关节特征的单目视觉三维人体姿态估计
A Monocular Vision 3D Human Posture Estimation Network Based on Multi-joint Feature Fusion
- 2025年 页码:1-15
收稿日期:2024-11-04,
修回日期:2025-02-07,
录用日期:2025-02-25,
网络出版日期:2025-02-26
DOI: 10.11834/jig.240672
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-11-04,
修回日期:2025-02-07,
录用日期:2025-02-25,
网络出版日期:2025-02-26,
移动端阅览
目的
2
针对目前三维人体姿态估计方法未能有效地处理时间序列冗余,难以捕获人体关节上的微小变化的问题,本文提出一种融合多关节特征的单目视觉三维人体姿态估计网络。
方法
2
在关节运动特征提取模块中,采用多分支操作提取关节在时间维度上的运动特征,并将不同特征融合形成具有高度表达力的特征表示。关节特征融合模块整合了不同关节组和中间帧的全局信息,通过矩阵内积的方式表达不同关节组在高纬度空间的相对位置及相互联系,得到中间3D姿态的初估值。关节约束模块引入中间帧的2D关节点空间位置关系作为隐式约束,与中间帧3D姿态初估值融合,减少不合理的姿态输出,提高最终3D姿态估计的准确性。
结果
2
实验结果表明,与MHFormer方法相比,本方法在Human3.6M数据集上的平均关节位置误差(mean per joint position error,MPJPE)结果为29.0mm,误差降低4.9%,在复杂动作,如SittingDown和WalkDog,误差降低了7.7%和8.2%。在MPI-INF-3DHP数据集上,MPJPE指标降低36.2%,曲线下面积(area under the curve,AUC)指标上提升12.9%,正确关节点百分比(percentage of correct keypoints,PCK)指标上提升3%。体现出,在面对复杂动作问题时,网络利用各分支提取了不同的关节时序运动特征,将不同关节组的位置信息进行融合交互,结合当前帧的关节姿态信息加以约束,得到更高的精度。在HumanEva数据集上的实验结果验证了本方法适用不同数据集,消融实验进一步验证了各个模块的有效性。
结论
2
本文提出的网络由于有效地融合了人体多关节特征,可以更好地提高单目视觉三维人体姿态估计的准确性,且具备较高的泛化性。
Objective
2
Despite the widespread use of monocular cameras for capturing photos and videos in various scenarios, they still encounter challenging issues such as human occlusion and loss of depth information. The goal is to achieve stability and accuracy in this process. Conventional approaches have attempted to estimate the 3D human position solely from images, relying on manual feature extraction to reconstruct the stance. However, these methods suffer from low accuracy as they lack depth information and encounter issues with occlusion. The lack of depth information and the presence of human occlusion in monocular camera images still hinder the accurate estimation of an individual's three-dimensional position. Thanks to the improved accuracy of 2D joint point detectors, the work may be divided into two stages. Initially, the task entails employing human 2D joint point detectors on video or RGB images to estimate the location of 2D joint points in the image space. In the second phase, the 2D joint points are transformed into 3D space by leveraging their positions. This work focuses on the second stage of the research, which involves using 2D joint point data to rebuild the 3D position of the human body. RNN has been utilized for human posture estimation, following the successful application of deep learning in various other fields. Recurrent Neural Network (RNN) has the ability to extract human postural variables that change in both space and time. Moreover, RNN performs more efficiently compared to other approaches. Nevertheless, the high computational cost of RNN makes it impractical for predicting extended sequences. Consequently, TCN networks are employed to estimate human posture. The TCN model has a reduced number of parameters and achieves a lower level of inaccuracy. On the other hand, TCN faces limitations in effectively combining spatial and temporal features and not fully utilizing information from 2D joints. Consequently, this paper presents a 3d human pose estimation network that integrates multi-joint features.
Method
2
The joint motion feature extraction module comprises four parallel branches, each extracting distinct joint motion features in the temporal dimension. By fusing these diverse features, the module forms a highly expressive feature representation that enhances the network's capability to capture subtle motion changes. In 3D human pose estimation, the joint feature fusion module independently extracts temporally significant features for each joint set and utilizes matrix inner products to describe the relative spatial relationships between different joint groups. This method establishes overall inter-group connections, ensuring the accuracy and coherence of the estimated poses. The joint restraint module enhances the accuracy of the final predicted 3D pose and avoids the occurrence of unrealistic poses by using the spatial information of the 2D joints in the intermediate frames as global information and applying 2D-3D implicit constraints.
Result
2
In this paper, we assess the accuracy of three public datasets in order to compare the proposed method with state-of-the-art models. Metrics evaluation is based on the models' ability to reconstruct 3D bit-postures. In order to assess the suitability of the proposed method, we conducted experiments on three separate datasets using the same network structure. The experimental findings on the Human3.6M dataset indicate that the current method yields an MPJPE value of 29.0mm. This represents a 4.9% reduction in error when compared to the MHFormer method, demonstrating the method's capability to attain noteworthy outcomes and substantiating its efficacy. For complex actions such as SittingDown and WalkDog, the error was reduced by 7.7% and 8.2%. The proposed method exhibits substantial improvements over the MHFormer approach on the 3DHP dataset in terms of evaluation metrics. Specifically, there is a reduction of 36.2% in the Mean Per Joint Position Error (MPJPE), an increase of 12.9% in the Area Under Curve (AUC) metric, and a noteworthy uplift of 3% in the Percentage of Correct Keypoints (PCK) score. The experimental results demonstrate that, for complex actions, the network utilizes different branches to extract diverse temporal motion features of joints. This enables the inference of joint positions based on contextual information and enhances spatiotemporal information interaction between joint groups using the positional information of other joints. By integrating the joint information from the current frame, the network constrains the estimated poses to be more realistic, achieving superior spatiotemporal modeling. Despite these advancements, when benchmarked against state-of-the-art methods on the HumanEva dataset, our method maintains a strong competitive edge in the field of 3D human pose estimation. Across every dataset tested, the present method consistently yields favorable outcomes, thus substantiating its superior generalizability and applicability to diverse human pose datasets. Moreover, this study conducts thorough ablation experiments to validate the effectiveness of each module, demonstrating that each component indeed contributes to enhancing the overall accuracy of the network. Additionally, through ablative studies on the length of input time series, it is revealed that the longer the input sequence, the better the performance of our method, leading to increasingly accurate estimates of 3D human poses.
Conclusion
2
In this study, the experimental results and visual representations indicate that the human pose estimation model we propose, incorporating three integral modules, excels at building temporal and spatial dependencies between joints compared to other methodologies. This model facilitates a comprehensive representation of the human body's joint space, particularly effective when addressing the issue of self-occlusion. In scenarios where body parts are occluded, our method effectively utilizes the interdependencies within the context to alleviate the impact of occlusions on human pose estimation, thereby boosting the precision of the predictions. Additionally, even in situations where depth information is lacking, our approach remains capable of accurately estimating human poses by leveraging the spatial positional relationships among various joints.
Belagiannis V , Amin S , Andriluka M , Schiele B , Navab N and Ilic S . 2014 . 3D pictorial structures for multiple human pose estimation // Proceedings of 2014 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Columbus, USA : IEEE: 1669 – 1676 [ DOI: 10.1109/CVPR.2014.216 http://dx.doi.org/10.1109/CVPR.2014.216 ]
Cai J , Liu H , Ding R , Li W , Wu J and Ban M . 2023 . HTNet: Human topology aware network for 3D human pose estimation // Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Rhodes Island, Greece : IEEE: 1 - 5 [ DOI: 10.1109/ICASSP49357.2023.10095949 http://dx.doi.org/10.1109/ICASSP49357.2023.10095949 ]
Chen T , Fang C , Shen X , Zhu Y , Chen Z and Luo J . 2022 . Anatomy-aware 3D human pose estimation with bone-based pose decomposition . IEEE Transactions on Circuits and Systems for Video Technology , 32 ( 1 ): 198 - 209 [ DOI: 10.1109/TCSVT.2021.3057267 http://dx.doi.org/10.1109/TCSVT.2021.3057267 ]
Chen X N , Liang C , Huang D , Real E , Wang K , Liu Y , Pham H , Dong X , Luong T , Hsieh C , Lu Y and Le Q V . 2023 . Symbolic discovery of optimization algorithms [EB/OL]. [ 2023-02-13 ]. https://arxiv.org/pdf/2302.06675.pdf https://arxiv.org/pdf/2302.06675.pdf
Chen Z M , Dai J and Pan J J . 2024 . A conditional diffusion model for 3D human pose estimation // Proceedings of International Conference on Consumer Electronics and Computer Engineering . Guangzhou, China : IEEE: 20 - 24 [ DOI: 10.1109/ICCECE61317.2024.10504230 http://dx.doi.org/10.1109/ICCECE61317.2024.10504230 ]
Diaz-Arias A and Shin D . 2024 . ConvFormer: Parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention . The Visual Computer , 40 ( 4 ): 2555 - 2569 [ DOI: 10.1007/s00371-023-02936-5 http://dx.doi.org/10.1007/s00371-023-02936-5 ]
Einfalt M , Ludwig K and Lienhart R . 2023 . Uplift and upsample: efficient 3D human pose estimation with uplifting Transformers // Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa, USA : IEEE: 2902 - 2912 [ DOI: 10.1109/WACV56688.2023.00292 http://dx.doi.org/10.1109/WACV56688.2023.00292 ]
Hassanin M , Khamiss A , Bennamoun M , Boussaid F and Radwan I . 2022 . CrossFormer: cross spatio-temporal transformer for 3D human pose estimation [EB/OL]. [ 2022-3-24 ]. https://arxiv.org/pdf/2203.13387.pdf https://arxiv.org/pdf/2203.13387.pdf
Hossain M R I and Little J J . 2017 . Exploiting temporal information for 3D pose estimation [EB/OL]. [ 2017-11-23 ]. https://arxiv.org/pdf/1711.08585v4.pdf https://arxiv.org/pdf/1711.08585v4.pdf
Hu W B , Zhang C G , Zhan F N , Zhang L and Wong T T . 2021 . Conditional directed graph convolution for 3D human pose estimation // Proceedings of the 29th ACM International Conference on Multimedia . New York, USA : ACM: 602 - 611 [ DOI: 10.1145/3474085.3475219 http://dx.doi.org/10.1145/3474085.3475219 ]
Ionescu C , Papava D , Olaru V and Sminchisescu C . 2014 . Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments . IEEE Transactions on Pattern Analysis and Machine Intelligence , 36 ( 7 ): 1325 - 1339 [ DOI: 10.1109/TPAMI.2013.248 http://dx.doi.org/10.1109/TPAMI.2013.248 ]
Kang H , Wang Y , Liu M , Wu D D , Liu P and Yang W M . 2023 . Double-chain constraints for 3D human pose estimation in images and videos [EB/OL]. [ 2023-08-10 ]. https://arxiv.org/pdf/2308.05298.pdf https://arxiv.org/pdf/2308.05298.pdf
Li H , Shi B , Dai W , Zheng H , Wang B , Sun Y , Guo M , Li C L , Zou J and Xiong H K . 2023 . Pose-oriented transformer with uncertainty guided refinement for 2D-to-3D human pose estimation . AAAI Conference on Artificial Intelligence , 37 ( 1 ): 1296 – 1304 [ DOI: 10.1609/aaai.v37i1.25213 http://dx.doi.org/10.1609/aaai.v37i1.25213 ]
Li W , Liu H , Tang H , Wang P and Van G L . 2022 . MHFormer: multi-hypothesis transformer for 3D human pose estimation // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, USA : IEEE: 13137 - 13146 [ DOI: 10.1109/CVPR52688.2022.01280 http://dx.doi.org/10.1109/CVPR52688.2022.01280 ]
Li W H , Liu M Y , Liu H , Guo T Y , Wang T , Tang H and Sebe N . 2022 . GraphMLP: a graph mlp like architecture for 3D human pose estimation [EB/OL].[ 2022-06-13 ]. https://arxiv.org/pdf/2206.06420.pdf https://arxiv.org/pdf/2206.06420.pdf
Li W , Liu H , Ding R , Liu M , Wang P and Yang W . 2023 . Exploiting temporal contexts with strided transformer for 3D human pose estimation . IEEE Transactions on Multimedia , 25 : 1282 - 1293 [ Doi: 10.1109/TMM.2022.3141231 http://dx.doi.org/10.1109/TMM.2022.3141231 ]
Li S J , Zhu H S , Wang L and Liu X P . 2022 . Dual auto-encoder network for human skeleton motion data optimization . Journal of Image and Graphics , 27 ( 4 ): 1277 - 1289
李书杰 , 朱海生 , 王磊 , 刘晓平 . 2022 . 面向人体骨骼运动数据优化的双自编码器网络 . 中国图象图形学报 , 27 ( 4 ): 1277 - 1289 [ DOI: 10.11834/jig.200780 http://dx.doi.org/10.11834/jig.200780 ]
Lin J H and Lee G H . 2019 . Trajectory space factorization for deep video-based 3D human pose estimation [EB/OL].[ 2019-08-22 ]. https://arxiv.org/pdf/1908.08289.pdf https://arxiv.org/pdf/1908.08289.pdf
Liu R , Shen J , Wang H , Chen C , Cheung S and Asari V . 2020 . Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 5063 - 5072 [ DOI: 10.1109/CVPR42600.2020.00511 http://dx.doi.org/10.1109/CVPR42600.2020.00511 ]
Liu J , Rojas J , Liang Z J Li Y H and Guan Y S . 2020 . A graph attention spatio-temporal convolutional networks for 3D human pose estimation in video [EB/OL]. [ 2020-03-11 ]. https://arxiv.org/pdf/2003.14179.pdf https://arxiv.org/pdf/2003.14179.pdf
Lin K , Wang L and Liu Z . 2021 . End-to-end human pose and mesh reconstruction with transformer // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 954 - 1963 [ DOI: 10.1109/CVPR46437.2021.00199 http://dx.doi.org/10.1109/CVPR46437.2021.00199 ]
Mehta D , Rhodin H , Casas D , Fua P , Sotnychenko O and Xu W . 2017 . Monocular 3D human pose estimation in the wild using improved cnn supervision // Proceedings of 2017 International Conference on 3D Vision . Qingdao, China : IEEE: 506 - 516 [ DOI: 10.1109/3DV.2017.00064 http://dx.doi.org/10.1109/3DV.2017.00064 ]
Mehta D Y , Sridhar S , Sotnychenko O , Rhodin H , Shafiei M , Seidel H , Xu W P , Casas D and Theobalt C . 2017 . VNect: real-time 3D human pose estimation with a single RGB camera . Association for Computing Machinery , 36 ( 4 ): 14 [ DOI: 10.1145/3072959.3073596 http://dx.doi.org/10.1145/3072959.3073596 ]
Pavlakos G , Zhou X W , Derpanis K G and Daniilidis K . 2017 . Coarse-to-fine volumetric prediction for single-image 3D human pose // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu, USA : IEEE: 1263 - 1272 [ DOI: 10.1109/CVPR.2017.139 http://dx.doi.org/10.1109/CVPR.2017.139 ]
Pavllo D , Feichtenhofer C , Grangier D and Auli M . 2019 3D human pose estimation in video with temporal convolutions and semi-supervised training // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 7745 – 7754 [ DOI: 0.1109/CVPR.2019.00794 http://dx.doi.org/0.1109/CVPR.2019.00794 ]
Shan W , Liu Z , Zhang X , Wang S , Ma S and Gao W . 2022 . P-STMO: pre-trained spatial temporal many-to-one model for 3D human pose estimation // Proceedings of the 17th European Conference on Computer Vision . Aviv, Israel : Springer: 461 – 478 [ DOI: 10.1007/978-3-031-20065-6_2 7 http://dx.doi.org/10.1007/978-3-031-20065-6_27 ]
Sigal L , Balan A O and Black M J . 2010 . HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion . International journal of computer vision , 87 ( 1-2 ): 4 - 27 [ DOI: 10.1007/s11263-009-0273-6 http://dx.doi.org/10.1007/s11263-009-0273-6 ]
Sun X , Shang J X , Liang S and Wei Y C . 2017 . Compositional human pose regression // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice, Italy : IEEE: 2621 - 2630 [ DOI: 10.1109/ICCV.2017.284 http://dx.doi.org/10.1109/ICCV.2017.284 ]
Tang Z H , Qiu Z F , Hao Y B , Hong R C and Yao T . 2023 . 3D human pose estimation with spatio-temporal criss-cross attention // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 4790 - 4799 [ DOI: 10.1109/CVPR52729.2023.00464 http://dx.doi.org/10.1109/CVPR52729.2023.00464 ]
Tekin B , Márquez-Neila P , Salzmann M , and Fua P . 2017 . Learning to fuse 2D and 3D image cues for monocular body pose estimation // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice, Italy : IEEE: 3961 - 3970 [DOI 10.1109/ICCV.2017.425 http://dx.doi.org/10.1109/ICCV.2017.425 ]
Wang J B , Yan S J , Xiong Y J and Lin D H . 2020 . Motion guided 3D pose estimation from videos // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 764 – 780 [ DOI: 10.1007/978-3-030-58601-0_45 http://dx.doi.org/10.1007/978-3-030-58601-0_45 ]
Xue Y Z , Chen J S , Gu X M , Ma H M and Ma H B . 2022 . Boosting monocular 3D human pose estimation with part aware attention . IEEE Transactions on Image Processing , 31 : 4278 - 4291 [ DOI: 10.1109/TIP.2022.3182269 http://dx.doi.org/10.1109/TIP.2022.3182269 ]
Xue F , Bian F L and Li S J . 2024 . Attention fusion network for estimation of 3D joint coordinates and rotation of human pose . Journal of Image and Graphics , 29 ( 10 ): 3116 - 3129
薛峰 , 边福利 , 李书杰 . 2024 . 面向三维人体坐标及旋转角估计的注意力融合网络 . 中国图象图形学报 , 2024, 29 ( 10 ): 3116 - 3129 [ DOI: 10.11834/jig.230502 http://dx.doi.org/10.11834/jig.230502 ]
Yan S J , Xiong Y J and Lin D H . 2018 . Spatial temporal graph convolutional networks for skeleton-based action recognition // Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence . New Orleans, USA : AAAI: 7444 - 7452 [ DOI: 10.5555/3504035.3504947 http://dx.doi.org/10.5555/3504035.3504947 ]
Zeng A L , Sun X , Huang F Y , Liu M H , Xu Q and Lin S . 2020 . SRNet: improving generalization in 3D human pose estimation with a split-and-recombine approach [EB/OL]. [ 2020-07-18 ]. https://arxiv.org/pdf/2007.09389.pdf https://arxiv.org/pdf/2007.09389.pdf
Zhan Y , Li F H , Weng R L and Choi W G . 2022 . Ray 3 d : ray-based 3D human pose estimation for monocular absolute 3D localization //Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA : IEEE: 13106 - 13115 [ DOI: 10.1109/CVPR52688.2022.01277 http://dx.doi.org/10.1109/CVPR52688.2022.01277 ]
Zhang J L , Tu Z G , Yang J Y , Chen Y J and Yuan J S . 2022 . MixSTE: seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, USA : IEEE: 13222 - 13232 [ DOI: 10.1109/CVPR52688.2022.01288 http://dx.doi.org/10.1109/CVPR52688.2022.01288 ]
Zhang K X , Luan X M , Shah Syed T H and Xiang X Z . 2023 . A improving cos-reweighting transformer for 3D human pose estimation in video // Proceedings of Chinese Control and Decision Conference . Yichang, China : IEEE: 436 - 441 [ DOI: 10.1109/CCDC58219.2023.10326602 http://dx.doi.org/10.1109/CCDC58219.2023.10326602 ]
Zheng C , Zhu S J , Mendieta M , Yang T , Chen , C and Ding Z M . 2021 . 3D human pose estimation with spatial and temporal transformers // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 11636 - 11645 [ DOI: 10.1109/ICCV48922.2021.01145 http://dx.doi.org/10.1109/ICCV48922.2021.01145 ]
相关作者
相关机构