面向三维人体坐标及旋转角估计的注意力融合网络
Attention fusion network for estimation of 3D joint coordinates and rotation of human pose
- 2024年29卷第10期 页码:3116-3129
纸质出版日期: 2024-10-16
DOI: 10.11834/jig.230502
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-10-16 ,
移动端阅览
薛峰, 边福利, 李书杰. 2024. 面向三维人体坐标及旋转角估计的注意力融合网络. 中国图象图形学报, 29(10):3116-3129
Xue Feng, Bian Fuli, Li Shujie. 2024. Attention fusion network for estimation of 3D joint coordinates and rotation of human pose. Journal of Image and Graphics, 29(10):3116-3129
目的
2
三维人体姿态估计是计算机视觉的研究热点之一,当前大多数方法直接从视频或二维坐标点回归人体三维关节坐标,忽略了关节旋转角的估计。但是,人体关节旋转角对于一些虚拟现实、计算机动画应用至关重要。为此,本文提出一种能同时估计三维人体坐标及旋转角的注意力融合网络。
方法
2
首先应用骨骼长度网络和骨骼方向网络分别从2D人体姿态序列中估计出人体骨骼长度和骨骼方向,并据此计算出初步的三维人体坐标,然后将初步的三维坐标输入关节旋转角估计网络得到关节旋转角,并应用前向运动学(forward kinematics,FK)层计算与关节旋转角对应的三维人体坐标。但由于网络模块的误差累积,与关节旋转角对应的三维人体坐标比初步的三维坐标精度有所降低,但是FK层输出的三维坐标具有更稳定的骨架结构。因此,为了综合这两种三维坐标序列的优势,最后通过注意力融合模块将初步的三维坐标及与关节旋转角对应的三维人体坐标融合为最终的三维关节坐标。这种分步估计的人体姿态估计算法,能够对估计的中间状态加以约束,并且使用注意力融合机制综合了高精度和骨骼稳定性的特点,使得最终结果的精度得到提升。另外,设计了一种专门的根关节处理模块,能够输出更高精度的根关节坐标,从而进一步提升三维人体坐标的精度和平滑性。
结果
2
实验在Human3.6M数据集上与对比方法比较平均关节位置误差(mean per joint position error,MPJPE),结果表明,与能够同时计算关节点坐标和旋转角的工作相比,本文方法取得了最好的精度。
结论
2
本文提出的方法能够同时从视频中估计人体关节坐标和关节旋转角度,并且得到的人体关节坐标比现有方法具有更高的精度。
Objective
2
Three-dimensional human pose estimation has always been a research hotspot in computer vision. Currently, most methods directly regress three-dimensional joint coordinates from videos or two-dimensional coordinate points, ignoring the estimation of joint rotation angles. However, joint rotation angles are crucial for certain applications, such as virtual reality and computer animation. To address this issue, we propose an attention fusion network for estimating three-dimensional human coordinates and rotation angles. Furthermore, many existing methods for video or motion sequence-based human pose estimation lack a dedicated network for handling the root joint separately. This limitation results in reduced overall coordinate accuracy, especially when the subject moves extensively within the scene, leading to drift and jitter phenomena. To tackle this problem, we also introduce a root joint processing approach, which ensures smoother and more stable motion of the root joint in the generated poses.
Method
2
Our proposed attention fusion network for estimating three-dimensional human coordinates and rotation angles follows a two-step approach. First, we use a well-established 2D pose estimation algorithm to estimate the 2D motion sequence from video or image sequences. Then, we employ a skeleton length network and a skeleton direction network to estimate the bone lengths and bone directions of the human body from the 2D human motion sequence. Based on these estimates, we calculate the initial 3D human coordinates. Next, we input the initial 3D coordinates into a joint rotation angle estimation network to obtain the joint rotation angles. We then apply forward kinematics to compute the 3D human coordinates corresponding to the joint rotation angles. However, given network errors, the precision of the 3D coordinates corresponding to the joint rotation angles is slightly lower than that of the initial 3D coordinates. To address this issue, we propose a final step where we use an attention fusion module to integrate the initial 3D coordinates and the 3D coordinates corresponding to the joint rotation angles into the final 3D joint coordinates. This stepwise estimation algorithm for human pose estimation allows for constraints on the intermediate states of the estimation. Moreover, the attention fusion mechanism helps mitigate the accuracy loss caused by the errors in the joint rotation angle network, resulting in improved precision in the final results.
Result
2
We select several representative methods and conduct experiments on the Human3.6M dataset to compare their performance in terms of the mean per joint position error (MPJPE) metric. The Human3.6M dataset is one of the largest publicly available datasets in the field of human pose estimation. It consists of seven different subjects, each performing 15 different actions captured by four cameras. Each action is annotated with 2D and 3D pose annotations and camera intrinsic and extrinsic parameters. The actions in the dataset include walking, jumping, and fist-clenching, covering a wide range of human daily activities. Experimental results demonstrate that our proposed method achieves highly competitive results. The average MPJPE achieved by our method is 45.0 mm across all actions, and it achieves the best average MPJPE in some actions while obtaining the second-best average MPJPE in most of the other actions. The method that achieves the first-place result cannot estimate joint rotation angles while estimating 3D joint coordinates, which is precisely the strength of our proposed method. Below is an introduction to our model’s training method. We use the Adam optimizer for stochastic gradient descent and minimize the loss function. The batch size is set to 64, and the motion sequence length is set to 80. The learning rate is set to 0.001, and we train for 50 epochs. To prevent overfitting, we add dropout layers in each module with a parameter of 0.25.
Conclusion
2
To address the issue of rotation ambiguity in traditional human pose estimation methods that estimate 3D joint coordinates, we propose an attention fusion network for estimating 3D human coordinates and rotation angles. This method decomposes the 3D coordinates into skeleton lengths, skeleton directions, and joint rotation angles. First, on the basis of the skeleton lengths and directions, we calculate the initial 3D joint coordinate sequence. Then, we input the 3D and 2D coordinates into the joint rotation module to compute the joint rotation angles corresponding to the joint coordinates. However, given factors such as network errors, the precision of the 3D joint coordinates may decrease during this process. Therefore, we employ an attention fusion network to mitigate these adverse effects and obtain more accurate 3D coordinates. Through comparative experiments, we demonstrate that our proposed method not only achieves more competitive results in terms of joint coordinate estimation accuracy but also estimates the corresponding joint rotation angles simultaneously with the 3D joint coordinates from the video.
人体姿态估计关节坐标关节旋转角注意力融合分步估计
human pose estimationjoint coordinatesjoint rotation angleattention fusionstepwise estimation
Bai S J, Kolter J Z and Koltun V. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling [EB/OL]. [2023-12-01]. https://arxiv.org/pdf/1803.01271.pdfhttps://arxiv.org/pdf/1803.01271.pdf
Cai Y J, Ge L H, Liu J, Cai J F, Cham T J, Yuan J S and Thalmann N M. 2019. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 2272-2281 [DOI: 10.1109/ICCV.2019.00236http://dx.doi.org/10.1109/ICCV.2019.00236]
Cao Z, Hidalgo G, Simon T, Wei S and Sheikh Y. 2021. OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1): 172-186[DOI: 10.1109/TPAMI.2019.2929257http://dx.doi.org/10.1109/TPAMI.2019.2929257]
Chen T L, Fang C, Shen X H, Zhu Y H, Chen Z L and Luo J B. 2022. Anatomy-aware 3D human pose estimation with bone-based pose decomposition. IEEE Transactions on Circuits and Systems for Video Technology, 32(1): 198-209 [DOI: 10.1109/TCSVT.2021.3057267http://dx.doi.org/10.1109/TCSVT.2021.3057267]
Chen Y L, Wang Z C, Peng Y X, Zhang Z Q, Yu G and Sun J. 2018. Cascaded pyramid network for multi-person pose estimation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7103-7112 [DOI: 10.1109/CVPR.2018.00742http://dx.doi.org/10.1109/CVPR.2018.00742]
Einfalt M, Ludwig K and Lienhart R. 2023. Uplift and upsample: efficient 3D human pose estimation with uplifting Transformers//Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 2902-2912 [DOI: 10.1109/WACV56688.2023.00292http://dx.doi.org/10.1109/WACV56688.2023.00292]
Feng Y F, Xiao J, Zhuang Y T, Yang X S, Zhang J J and Song R. 2014. Exploiting temporal stability and low-rank structure for motion capture data refinement. Information Sciences, 277: 777-793 [DOI: 10.1016/j.ins.2014.03.013http://dx.doi.org/10.1016/j.ins.2014.03.013]
Ionescu C, Papava D, Olaru V and Sminchisescu C. 2014. Human-3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7): 1325-1339 [DOI: 10.1109/TPAMI.2013.248http://dx.doi.org/10.1109/TPAMI.2013.248]
Kingma D P and Ba J. 2015. Adam: a method for stochastic optimization//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: [s.n.]
Lee K, Lee I and Lee S. 2018. Propagating LSTM: 3D pose estimation based on joint interdependency//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 123-141 [DOI: 10.1007/978-3-030-01234-2_8http://dx.doi.org/10.1007/978-3-030-01234-2_8]
Li S J and Chan A B. 2015. 3D human pose estimation from monocular images with deep convolutional neural network//Proceedings of the 12th Asian Conference on Computer Vision. Singapore, Singapore: Springer: 332-347 [DOI: 10.1007/978-3-319-16808-1_23http://dx.doi.org/10.1007/978-3-319-16808-1_23]
Li S J, Zhou Y, Zhu H S, Xie W J, Zhao Y and Liu X P. 2019. Bidirectional recurrent autoencoder for 3D skeleton motion data refinement. Computers and Graphics, 81: 92-103 [DOI: 10.1016/j.cag.2019.03.010http://dx.doi.org/10.1016/j.cag.2019.03.010]
Li S J, Zhu H S, Wang L and Liu X P. 2022. Dual auto-encoder network for human skeleton motion data optimization. Journal of Image and Graphics, 27(4): 1277-1289
李书杰, 朱海生, 王磊, 刘晓平. 2022. 面向人体骨骼运动数据优化的双自编码器网络. 中国图象图形学报, 27(4): 1277-1289 [DOI: 10.11834/jig.200780http://dx.doi.org/10.11834/jig.200780]
Li S J, Zhu H S, Zheng L P and Li L. 2020. A perceptual-based noise-agnostic 3D skeleton motion data refinement network. IEEE Access, 8: 52927-52940 [DOI: 10.1109/ACCESS.2020.2980316http://dx.doi.org/10.1109/ACCESS.2020.2980316]
Li W H, Liu H, Ding R W, Liu M Y, Wang P C and Yang W M. 2023. Exploiting temporal contexts with strided Transformer for 3D human pose estimation. IEEE Transactions on Multimedia, 25: 1282-1293 [DOI: 10.1109/TMM.2022.3141231http://dx.doi.org/10.1109/TMM.2022.3141231]
Liao L J, Zhong C Y, Zhang Z H, Hu L, Zhang Z H and Xia S H. 2022. 3D human pose sequence estimation from point clouds combing temporal feature and joint learning strategy. Journal of Image and Graphics, 27(12): 3608-3621
廖联军, 钟重阳, 张智恒, 胡磊, 张子豪, 夏时洪. 2022. 融合时序特征约束与联合优化的点云3维人体姿态序列估计. 中国图象图形学报, 27(12): 3608-3621 [DOI: 10.11834/jig.210836http://dx.doi.org/10.11834/jig.210836]
Liu Q, Liu Z H, Xiong B, Xu W J and Liu Y. 2021. Deep reinforcement learning-based safe interaction for industrial human-robot collaboration using intrinsic reward function. Advanced Engineering Informatics, 49: #101360 [DOI: 10.1016/j.aei.2021.101360http://dx.doi.org/10.1016/j.aei.2021.101360]
Ma H F, Lu K, Xue J, Niu Z H and Gao P C. 2022a. Local to global Transformer for video based 3D human pose estimation//Proceedings of 2022 IEEE International Conference on Multimedia and Expo Workshops. Taipei, China: IEEE: 1-6 [DOI: 10.1109/ICMEW56448.2022.9859482http://dx.doi.org/10.1109/ICMEW56448.2022.9859482]
Ma H Y, Wang Z, Chen Y F, Kong D Y, Chen L J, Liu X W, Yan X Y, Tang H and Xie X H. 2022b. PPT: token-pruned pose Transformer for monocular and multi-view human pose estimation//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 424-442 [DOI: 10.1007/978-3-031-20065-6_25http://dx.doi.org/10.1007/978-3-031-20065-6_25]
Mall U, Lal G R, Chaudhuri S and Chaudhuri P. 2017. A deep recurrent framework for cleaning motion capture data[EB/OL]. [2023-12-01]. https://arxiv.org/pdf/1712.03380.pdfhttps://arxiv.org/pdf/1712.03380.pdf
Martínez-Gonzlez A, Villamizar M and Odobez J M. 2021. Pose Transformers (POTR): human motion prediction with non-autoregressive Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal, Canada: IEEE: 2276-2284 [DOI: 10.1109/ICCVW54120.2021.00257http://dx.doi.org/10.1109/ICCVW54120.2021.00257]
Martinez J, Black M J and Romero J. 2017a. On human motion prediction using recurrent neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2891-2900 [DOI: 10.1109/CVPR.2017.497http://dx.doi.org/10.1109/CVPR.2017.497]
Martinez J, Hossain R, Romero J and Little J J. 2017b. A simple yet effective baseline for 3D human pose estimation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2659-2668 [DOI: 10.1109/ICCV.2017.288http://dx.doi.org/10.1109/ICCV.2017.288]
Mehta D, Sotnychenko O, Mueller F, Xu W P, Elgharib M, Fua P, Seidel H P, Rhodin H, Pons-Moll G and Theobalt C. 2020. XNect: real-time multi-person 3D motion capture with a single RGB camera. ACM Transactions on Graphics, 39(4): #82 [DOI: 10.1145/3386569.3392410http://dx.doi.org/10.1145/3386569.3392410]
Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, Seidel H P, Xu W P, Casas D and Theobalt C. 2017. VNect: real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics, 36(4): #44 [DOI: 10.1145/3072959.3073596http://dx.doi.org/10.1145/3072959.3073596]
Mourot L, Hoyet L, Le Clerc F, Schnitzler F and Hellier P. 2022. A survey on deep learning for skeleton‐based human animation. Computer Graphics Forum, 41(1): 122-157 [DOI: 10.1111/cgf.14426http://dx.doi.org/10.1111/cgf.14426]
Pavlakos G, Zhou X W, Derpanis K G and Daniilidis K. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1263-1272 [DOI: 10.1109/CVPR.2017.139http://dx.doi.org/10.1109/CVPR.2017.139]
Pavllo D, Feichtenhofer C, Grangier D and Auli M. 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7753-7762 [DOI: 10.1109/CVPR.2019.00794http://dx.doi.org/10.1109/CVPR.2019.00794]
Pavllo D, Grangier D and Auli M. 2018. QuaterNet: a quaternion-based recurrent model for human motion//Proceedings of 2018 the British Machine Vision Conference 2018. Newcastle, UK: BMVA Press
Shi M Y, Aberman K, Aristidou A, Komura T, Lischinski D, Cohen-Or D and Chen B Q. 2021. MotioNet: 3D human motion reconstruction from monocular video with skeleton consistency. ACM Transactions on Graphics, 40(1): #1 [DOI: 10.1145/3407659http://dx.doi.org/10.1145/3407659]
Tao S and Wang M L. 2022. Stroke recognition in badminton videos based on pose estimation and temporal segment networks analysis. Journal of Image and Graphics, 27(11): 3280-3291
陶树, 王美丽. 2022. 结合姿态估计和时序分段网络分析的羽毛球视频动作识别. 中国图象图形学报, 27(11): 3280-3291 [DOI: 10.11834/jig.210407http://dx.doi.org/10.11834/jig.210407]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 30th International Conference on Neural Information Processing Systems. Long Beach, USA: [s.n.]
Villegas R, Yang J M, Ceylan D and Lee H. 2018. Neural kinematic networks for unsupervised motion retargetting//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8639-8648 [DOI: 10.1109/CVPR.2018.00901http://dx.doi.org/10.1109/CVPR.2018.00901]
Wang J D, Sun K, Cheng T H, Jiang B R, Deng C R, Zhao Y, Liu D, Mu Y D, Tan M K, Wang X G, Liu W Y and Xiao B. 2021. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10): 3349-3364 [DOI: 10.1109/TPAMI.2020.2983686http://dx.doi.org/10.1109/TPAMI.2020.2983686]
Wang T F and Zhang X X. 2023. Gated region-refine pose Transformer for human pose estimation. Neurocomputing, 530: 37-47 [DOI: 10.1016/j.neucom.2023.01.090http://dx.doi.org/10.1016/j.neucom.2023.01.090]
Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and the 30th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI Press: #912 [DOI: 10.1609/aaai.v32i1.12328http://dx.doi.org/10.1609/aaai.v32i1.12328]
Yoshiyasu Y, Sagawa R, Ayusawa K and Murai A. 2018. Skeleton Transformer networks: 3D human pose and skinned mesh from single RGB image//Proceedings of the 14th Asian Conference on Computer Vision. Perth, Australia: Springer: 485-500 [DOI: 10.1007/978-3-030-20870-7_30http://dx.doi.org/10.1007/978-3-030-20870-7_30]
Yuan S Z, Zhou Y, Shen S and Xu J G. 2022. Research and application of virtual human real-time pose reconstruction based on extended reality//Proceedings of the 4th International Workshop on Artificial Intelligence and Education. Xiamen, China: IEEE: 22-26 [DOI: 10.1109/WAIE57417.2022.00012http://dx.doi.org/10.1109/WAIE57417.2022.00012]
Zhou X Y, Huang Q X, Sun X, Xue X Y and Wei Y C. 2017. Towards 3D human pose estimation in the wild: a weakly-supervised approach//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 398-407 [DOI: 10.1109/ICCV.2017.51http://dx.doi.org/10.1109/ICCV.2017.51]
Zhou Y, Barnes C, Lu J W, Yang J M and Li H. 2019. On the continuity of rotation representations in neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5738-5746 [DOI: 10.1109/CVPR.2019.00589http://dx.doi.org/10.1109/CVPR.2019.00589]
相关作者
相关机构