联合多视图可控融合和关节相关性的三维人体姿态估计
Combining multi-view controlled fusion and joint correlation for 3D human pose estimation
- 2025年30卷第1期 页码:254-267
纸质出版日期: 2025-01-16
DOI: 10.11834/jig.230908
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2025-01-16 ,
移动端阅览
董婧, 张鸿儒, 方小勇, 周东生, 杨鑫, 张强, 魏小鹏. 联合多视图可控融合和关节相关性的三维人体姿态估计[J]. 中国图象图形学报, 2025,30(1):254-267.
DONG JING, ZHANG HONGRU, FANG XIAOYONG, ZHOU DONGSHENG, YANG XIN, ZHANG QIANG, WEI XIAOPENG. Combining multi-view controlled fusion and joint correlation for 3D human pose estimation. [J]. Journal of image and graphics, 2025, 30(1): 254-267.
目的
2
多视图三维人体姿态估计能够从多方位的二维图像中估计出各个关节点的深度信息,克服单目三维人体姿态估计中因遮挡和深度模糊导致的不适定性问题,但如果系统性能被二维姿态估计结果的有效性所约束,则难以实现最终三维估计精度的进一步提升。为此,提出了一种联合多视图可控融合和关节相关性的三维人体姿态估计算法CFJCNet(controlled fusion and joint correlation network),包括多视图融合优化模块、二维姿态细化模块和结构化三角剖分模块3部分。
方法
2
首先,基于极线几何框架的多视图可控融合优化模块有选择地利用极线几何原理提高二维热图的估计质量,并减少噪声引入;然后,基于图卷积与注意力机制联合学习的二维姿态细化方法以单视图中关节点之间的联系性为约束,更好地学习人体的整体和局部信息,优化二维姿态估计;最后,引入结构化三角剖分以获取人体骨长先验知识,嵌入三维重建过程,改进三维人体姿态的估计性能。
结果
2
该算法在两个公共数据集Human3.6M、Total Capture和一个合成数据集Occlusion-Person上进行了评估实验,平均关节误差为17.1 mm、18.7 mm和10.2 mm,明显优于现有的多视图三维人体姿态估计算法。
结论
2
本文提出了一个能够构建多视图间人体关节一致性联系以及各自视图中人体骨架内在拓扑约束的多视图三维人体姿态估计算法,优化二维估计结果,修正错误姿态,有效地提高了三维人体姿态估计的精确度,取得了最佳的估计结果。
Objective
2
3D human pose estimation is fundamental to understanding human behavior and aims to estimate 3D joint points from images or videos. It is widely used in downstream tasks such as human-computer interaction, virtual fitting, autonomous driving, and pose tracking. According to the number of cameras, 3D human pose estimation can be divided into monocular 3D human pose estimation and multi-view 3D human pose estimation. The ill-posed problem caused by occlusion and depth ambiguity means that estimating the 3D human joint points by monocular 3D human pose estimation is difficult. However multi-view 3D human pose estimation can obtain the depth of each joint from multiple images, which can overcome this problem. In most recent methods, the triangulation module is used to estimate the 3D joint positions by leveraging their 2D counterparts measured in multiple images to 3D space. This module is usually used in a two-stage procedure: First, the 2D joint coordinates of the human on each view are estimated separately by using a 2D pose detector, and then the 3D pose from multi-view 2D poses by applying triangulation. On this basis, some methods work with epipolar geometry to fuse the human joint features to establish the correlation among multiple views, which can improve the accuracy of 3D estimation. However, when the system performance is constrained by the effectiveness of the 2D estimation results, improving the final 3D estimation accuracy further is difficult. Therefore, to extract human contextual information for more effective 2D features, we construct a novel 3D pose estimation network to explore the correlation of the same joint among multiple views and the correlation between neighbor joints in the single view.
Method
2
In this paper, we propose a 3D human pose estimation method based on multi-view controllable fusion and joint correlation (CFJCNet), which includes three parts: a controllable multi-view fusion optimization module, a 2D pose refinement module, and a structural triangulation module. First, a set of RGB images captured from multiple views are fed into the 2D detector to obtain the 2D heatmaps, and then the adaptive weights of each heatmap are learned by a weight learning network with appearance information and geometric information branches. On this basis, we construct a multi-view controlled fusion optimization module based on epipolar geometry framework, which can analyze the estimation quality of joints in each camera view to influence the fuse process. Specifically, it selectively utilizes the principles of epipolar geometry to fuse all views according to the weights, thus ensuring that the low-quality estimation can benefit from auxiliary views while avoiding the introduction of noise in high-quality heatmaps. Subsequently, a 2D pose refinement module composed of attention mechanisms and graph convolution is applied. The attention mechanism enables the model to capture the global content by weight assignment, while the graph convolutional network (GCN) can exploit local information by aggregating the features of the neighbor nodes an
d instruct the topological structure information of the human skeleton. The network combining the attention and GCN can not only learn human information better but also construct the interdependence between joint points in the single view to refine 2D pose estimation results. Finally, structural triangulation is introduced with structural constraints of the human body and human skeleton length in the process of 2D-to-3D inference to improve the accuracy of 3D pose estimation. This paper adopts the pre-trained 2D backbone called simple baseline as the 2D detector to extract 2D heatmaps. The threshold
ε
= 0.99 is used to determine the joint estimation quality, and the number of layers
N
= 3 is designed for 2D pose refinement.
Result
2
We compare the performance of CFJCNet with that of state-of-the-art models on two public datasets, namely, Human3.6M and Total Capture, and a synthetic dataset called Occlusion-Person. The mean per joint position error (MPJPE) is used as the evaluation metric, which measures the Euclidean distance between the estimated 3D joint positions and the ground truth. MPJPE can reflect the quality of the estimated 3D human poses, providing a more intuitive representation of the performance of different methods. On the Human3.6M dataset, the proposed method achieves an additional error reduction of 2.4 mm compared with the baseline Adafuse. Moreover, because our network introduces rich priori knowledge and effectively constructs the connectivity of human joints, CFJCNet achieves at least a 10% improvement compared with most methods that do not use the skinned multi-person linear (SMPL) model. Compared with learnable human mesh triangulation (LMT) incorporating the SMPL model and volumetric triangulation, our method still achieves a 0.5 mm error reduction. On the Total Capture dataset, compared with the excellent baseline Adafuse, our method exhibits a performance improvement of 2.6%. On the Occlusion-Person dataset, the CFJCNet achieves optimal estimation for the vast majority of joints, which improves performance by 19%. Furthermore, we compare the visualization results of 3D human pose estimation between our method and the baseline Adafuse on the Human3.6M dataset and the Total Capture dataset to provide a more intuitive demonstration of the estimation performance. The qualitative experimental results on both datasets demonstrate that CFJCNet can use the prior constraints of skeleton length to correct unreasonable erroneous poses.
Conclusion
2
We propose a multi-view 3D human pose estimation method CFJCNet, which is capable of constructing human joint consistency between multiple views as well as intrinsic topological constraints on the human skeleton in the respective views. The method achieves excellent 3D human pose estimation performance. Experimental results on the public datasets show that CFJCNet has significant advantages in evaluation metrics over other advanced methods, demonstrating its superiority and generalization.
多视图三维人体姿态估计关节相关性图卷积网络(GCN)注意力机制三角剖分
multi-view3D human pose estimationjoint point correlationgraph convolutional network(GCN)attention mechanismtriangulation
Bao Y M, Zhao X and Qian D H. 2023. FusePose: IMU-vision sensor fusion in kinematic space for parametric human pose estimation. IEEE Transactions on Multimedia, 25: 7736-7746 [DOI: 10.1109/TMM.2022.3227472http://dx.doi.org/10.1109/TMM.2022.3227472]
Chen H Y, He J Y, Xiang W M, Cheng Z Q, Liu W, Liu H B, Luo B, Geng Y F and Xie X S. 2023a. HDFormer: high-order directed transformer for 3D human pose estimation [EB/OL]. [2024-01-15]. https://arxiv.org/pdf/2302.01825.pdfhttps://arxiv.org/pdf/2302.01825.pdf
Chen Y X, Gu R S, Huang O H and Jia G Y. 2023b. VTP: volumetric transformer for multi-view multi-person 3D pose estimation. Applied Intelligence, 53(22): 26568-26579 [DOI: 10.1007/s10489-023-04805-zhttp://dx.doi.org/10.1007/s10489-023-04805-z]
Chen Z, Zhao X and Wan X Y. 2022. Structural triangulation: a closed-form solution to constrained 3D human pose estimation//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 695-711 [DOI: 10.1007/978-3-031-20065-6_40http://dx.doi.org/10.1007/978-3-031-20065-6_40]
Chun S, Park S and Chang J Y. 2023. Learnable human mesh triangulation for 3D human pose and shape estimation//Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 2849-2858 [DOI: 10.1109/WACV56688.2023.00287http://dx.doi.org/10.1109/WACV56688.2023.00287]
Du X X, Vasudevan R and Johnson-Roberson M. 2019. Bio-LSTM: a biomechanically inspired recurrent neural network for 3-D pedestrian pose and gait prediction. IEEE Robotics and Automation Letters, 4(2): 1501-1508 [DOI: 10.1109/LRA.2019.2895266http://dx.doi.org/10.1109/LRA.2019.2895266]
Hassan M T and Ben Hamza A. 2023. Regular splitting graph network for 3D human pose estimation. IEEE Transactions on Image Processing, 32: 4212-4222 [DOI: 10.1109/TIP.2023.3275914http://dx.doi.org/10.1109/TIP.2023.3275914]
He Y H, Yan R, Fragkiadaki K and Yu S I. 2020. Epipolar transformers//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7776-7785 [DOI: 10.1109/CVPR42600.2020.00780http://dx.doi.org/10.1109/CVPR42600.2020.00780]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [DOI: 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]
Ionescu C, Papava D, Olaru V and Sminchisescu C. 2014. Human 3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7): 1325-1339 [DOI: 10.1109/TPAMI.2013.248http://dx.doi.org/10.1109/TPAMI.2013.248]
Iskakov K, Burkov E, Lempitsky V and Malkov Y. 2019. Learnable triangulation of human pose//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7717-7726 [DOI: 10.1109/ICCV.2019.00781http://dx.doi.org/10.1109/ICCV.2019.00781]
Kadkhodamohammadi A and Padoy N. 2021. A generalizable approach for multi-view 3D human pose regression. Machine Vision and Applications, 32(1): #6 [DOI: 10.1007/s00138-020-01120-2http://dx.doi.org/10.1007/s00138-020-01120-2]
Kim H W, Lee G H, Oh M S and Lee S W. 2023. Cross-view self-fusion for self-supervised 3D human pose estimation in the wild//Proceedings of the 16th Asian Conference on Computer Vision. Macau, China: Springer: 193-210 [DOI: 10.1007/978-3-031-26319-4_12http://dx.doi.org/10.1007/978-3-031-26319-4_12]
Li W H, Liu H, Tang H, Wang P C and van Gool L. 2022. MHFormer: multi-hypothesis transformer for 3D human pose estimation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 13137-13146 [DOI: 10.1109/CVPR52688.2022.01280http://dx.doi.org/10.1109/CVPR52688.2022.01280]
Lu J X, Wan H, Li P Y, Zhao X B, Ma N and Gao Y. 2023. Exploring high-order spatio-temporal correlations from skeleton for person re-identification. IEEE Transactions on Image Processing, 32: 949-963 [DOI: 10.1109/TIP.2023.3236144http://dx.doi.org/10.1109/TIP.2023.3236144]
Ma H Y, Chen L J, Kong D Y, Wang Z, Liu X W, Tang H, Yan X Y, Xie Y S, Lin S Y and Xie X H. 2021. TransFusion: cross-view fusion with transformer for 3D human pose estimation [EB/OL]. [2024-01-15]. https://arxiv.org/pdf/2110.09554.pdfhttps://arxiv.org/pdf/2110.09554.pdf
Ma H Y, Wang Z, Chen Y F, Kong D Y, Chen L J, Liu X W, Yan X Y, Tang H and Xie X H. 2022. PPT: token-pruned pose transformer for monocular and multi-view human pose estimation//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 424-442 [DOI: 10.1007/978-3-031-20065-6_25http://dx.doi.org/10.1007/978-3-031-20065-6_25]
Ma M, Li Y B, Wu X Q, Gao J F and Pan H P. 2020. Human pose tracking based on multi-feature fusion in videos. Journal of Image and Graphics, 25(7): 1459-1472
马淼, 李贻斌, 武宪青, 高金凤, 潘海鹏. 2020. 视频中多特征融合人体姿态跟踪. 中国图象图形学报, 25(7): 1459-1472 [DOI: 10.11834/jig.190494http://dx.doi.org/10.11834/jig.190494]
Minar M R and Ahn H. 2021. CloTH-VTON: clothing three-dimensional reconstruction for hybrid image-based virtual try-on//Proceedings of the 15th Asian Conference on Computer Vision. Kyoto, Japan: Springer: 154-172 [DOI: 10.1007/978-3-030-69544-6_10http://dx.doi.org/10.1007/978-3-030-69544-6_10]
Pascual-Hernndez D, de Frutos N O, Mora-Jiménez I and Cañas-Plaza J M. 2022. Efficient 3D human pose estimation from RGBD sensors. Displays, 74: #102225 [DOI: 10.1016/j.displa.2022.102225http://dx.doi.org/10.1016/j.displa.2022.102225]
Qiu H B, Wang C Y, Wang J D, Wang N Y and Zeng W J. 2019. Cross view fusion for 3D human pose estimation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4341-4350 [DOI: 10.1109/ICCV.2019.00444http://dx.doi.org/10.1109/ICCV.2019.00444]
Remelli E, Han S C, Honari S, Fua P and Wang R. 2020. Lightweight multi-view 3D pose estimation through camera-disentangled representation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6039-6048 [DOI: 10.1109/CVPR42600.2020.00608http://dx.doi.org/10.1109/CVPR42600.2020.00608]
Shuai H, Wu L L and Liu Q S. 2023. Adaptive multi-view and temporal fusing transformer for 3D human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4): 4122-4135 [DOI: 10.1109/TPAMI.2022.3188716http://dx.doi.org/10.1109/TPAMI.2022.3188716]
Tang Z H, Li J, Hao Y B and Hong R C. 2023. MLP-JCG: multi-layer perceptron with joint-coordinate gating for efficient 3D human pose estimation. IEEE Transactions on Multimedia, 25: 8712-8724 [DOI: 10.1109/TMM.2023.3240455http://dx.doi.org/10.1109/TMM.2023.3240455]
Trumble M, Gilbert A, Malleson C, Hilton A and Collomosse J. 2017. Total capture: 3D human pose estimation fusing video and inertial sensors//Proceedings of the 28th British Machine Vision Conference. London: UK, BMVA: 1-13 [DOI: 10.5244/c.31.14http://dx.doi.org/10.5244/c.31.14]
Tu H Y, Wang C Y and Zeng W J. 2020. VoxelPose: towards multi-camera 3D human pose estimation in wild environment//Proceedings of the 16th European Conference on Computer Vision—ECCV 2020. Glasgow, UK: Springer: 197-212 [DOI: 10.1007/978-3-030-58452-8_12http://dx.doi.org/10.1007/978-3-030-58452-8_12]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wan X Y, Chen Z and Zhao X. 2023. View consistency aware holistic triangulation for 3D human pose estimation [EB/OL]. [2024-01-15]. https://arxiv.org/pdf/2302.11301.pdfhttps://arxiv.org/pdf/2302.11301.pdf
Wu W, Zhou D S, Zhang Q, Dong J and Wei X P. 2022. High-order local connection network for 3D human pose estimation based on GCN. Applied Intelligence, 52(13): 15690-15702 [DOI: 10.1007/s10489-022-03312-xhttp://dx.doi.org/10.1007/s10489-022-03312-x]
Xiao B, Wu H P and Wei Y C. 2018. Simple baselines for human pose estimation and tracking//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 472-487 [DOI: 10.1007/978-3-030-01231-1_29http://dx.doi.org/10.1007/978-3-030-01231-1_29]
Yan J J, Zheng W M, Xin M H and Qiu W. 2013. Bimodal emotion recognition based on body gesture and facial expression. Journal of Image and Graphics, 18(9): 1101-1106
闫静杰, 郑文明, 辛明海, 邱伟. 2013. 表情和姿态的双模态情感识别. 中国图象图形学报, 18(9): 1101-1106 [DOI: 10.11834/jig.20130906http://dx.doi.org/10.11834/jig.20130906]
Zhang Z, Wang C Y, Qin W H and Zeng W J. 2020. Fusing wearable IMUs with multi-view images for human pose estimation: a geometric approach//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2197-2206 [DOI: 10.1109/CVPR42600.2020.00227http://dx.doi.org/10.1109/CVPR42600.2020.00227]
Zhang Z, Wang C Y, Qiu W C, Qin W H and Zeng W J. 2021. Adafuse: adaptive multiview fusion for accurate human pose estimation in the wild. International Journal of Computer Vision, 129(3): 703-718 [DOI: 10.1007/s11263-020-01398-9http://dx.doi.org/10.1007/s11263-020-01398-9]
Zou Z M and Tang W. 2021. Modulated graph convolutional network for 3D human pose estimation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada, IEEE: 11457-11467 [DOI: 10.1109/ICCV48922.2021.01128http://dx.doi.org/10.1109/ICCV48922.2021.01128]
相关作者
相关机构