融合特征增强与互补的手物姿态估计方法

顾思远; 高曙

doi:10.11834/jig.240272

浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

融合特征增强与互补的手物姿态估计方法
Hand-object pose estimation method based on fusion feature enhancement and complementary
2024年页码：1-16
网络出版日期： 2024-12-23 ，
DOI： 10.11834/jig.240272
稿件说明：

移动端阅览

顾思远,高曙.融合特征增强与互补的手物姿态估计方法[J].中国图象图形学报,

GU Siyuan,GAO Shu.Hand-object pose estimation method based on fusion feature enhancement and complementary[J].Journal of Image and Graphics,
顾思远,高曙.融合特征增强与互补的手物姿态估计方法[J].中国图象图形学报, DOI： 10.11834/jig.240272.

GU Siyuan,GAO Shu.Hand-object pose estimation method based on fusion feature enhancement and complementary[J].Journal of Image and Graphics, DOI： 10.11834/jig.240272.

摘要

目的

从单个RGB图像进行联合手物姿态估计极具挑战性，因为当手与物体交互时，经常会发生严重的遮挡。此外，现有的手物特征提取网络通常使用特征金字塔网络（feature pyramid network，FPN）融合多尺度特征，然而，基于FPN的方法存在通道信息丢失的问题。针对以上问题，提出手物特征增强互补模型（hand-object feature enhancement complementary，HOFEC）。

方法

1）针对通道信息丢失问题，设计基于通道注意力引导的特征金字塔网络（channel attention-guided feature pyramid network，CAG-FPN），将通道注意力机制引入FPN，使得模型在融合多尺度特征过程中更好地关注输入数据中不同通道之间的关系和重要性，并结合基于局部共享的双流网络ResNet-50（50-layer residual network）共同构建手物特征提取网络，提高模型的特征提取能力。2）针对手物交互时相互遮挡问题，设计空间注意力模块，分别增强手物特征，同时提取手物遮挡区域信息，并进一步设计交叉注意力模块，进行手物特征互补，从而充分利用手部区域和物体区域遮挡信息，实现特征增强与互补。3）通过手部解码器与物体解码器分别恢复手部姿态与物体姿态。

结果

在HO3D与Dex-ycb数据集上与SOTA模型相比，本文方法在手部姿态估计任务与物体姿态估计任务上均取得了有竞争力的效果。在HO3D数据集上，与最近的10种模型进行了比较，手部姿态估计指标PAMPJPE与PAMPVPE均比次优的HandOccNet提高了0.1mm，物体姿态估计指标ADD-0.1D比次优的HFL-Net提高了2.1%；在Dex-ycb数据集上，与最近的7种模型进行了比较，手部姿态估计指标MPJPE与PAMPJPE分别比次优的HFL-Net提高了0.2mm、0.1mm，物体姿态估计指标ADD-0.1D比次优的HFL-Net提高了6.4%。

结论

本文提出的HOFEC模型能够在手物交互场景下同时准确地估计手部姿态与物体姿态（本文方法代码网址：

https：//github.com/rookiiiie/HOFEC

https://github.com/rookiiiie/HOFEC

）。

Abstract

Objective

Estimating joint hand-object poses from a single RGB image is a highly challenging task， primarily due to the severe occlusions that often occur during hand-object interactions， complicating the identification of critical features. The interactive scenes involving hands and objects typically exhibit a high degree of dynamism and complexity， rendering traditional computer vision techniques ineffective in handling these intricate situations， especially in the presence of significant occlusions. Furthermore， existing hand-object feature extraction networks commonly utilize Feature Pyramid Networks （FPN） to fuse multi-scale features， aiming to capture information across different levels. However， FPN-based methods frequently encounter issues related to the loss of channel information during the feature extraction process， which can directly impact the accuracy of the final pose estimation. To address these challenges， we propose a novel hand-object feature enhancement complementary （HOFEC） model designed to optimize the featureextraction and fusion processes， thereby enhancing pose estimation performance under complex backgrounds and occlusion conditions.

Method

1） In order to effectively address the prevalent issue of channel information loss within feature extraction processes， we have proposed a novel architecture known as the channel attention-guided feature pyramid network （CAG-FPN）. This model strategically integrates a channel attention mechanism into the traditional feature pyramid network （FPN） framework， thereby enhancing the model's capability to discern and prioritize the intricate relationships and significance among various channels present in the input data during the critical multi-scale feature fusion stage.The channel attention mechanism operates by dynamically adjusting the weights assigned to different feature channels based on their relevance to the task at hand. This dynamic weighting allows the network to more effectively excavate and leverage crucial feature information that may be pivotal for accurate recognition tasks. Furthermore， we have augmented this architecture with a dual-stream ResNet-50 network designed around the principles of local sharing. This innovative approach enables us to jointly construct a comprehensive hand-object feature extraction network， which significantly enhances the model's overall feature extraction capabilities. As a result， our model exhibits a markedly improved capacity to capture and represent hand and object features， particularly in complex scenes characterized by high variability and occlusion. 2） To effectively tackle the challenges posed by mutual occlusion during hand-object interactions， we have developed a sophisticated spatial attention module designed to simultaneously enhance the features of both the hand and the object while extracting critical information regarding the occluded regions that may hinder visibility. The implementation of the spatial attention mechanism allows the model to selectively focus on significant areas of interest， thereby improving its ability to accurately recognize and interpret occluded regions that are essential for effective pose estimation. In addition to the spatial attention module， we have innovatively designed a cross-attention module that facilitates the exchange of secondary features between the hand and object. This module injects the secondary features of the hand into the primary features of the object and vice versa， thus fostering a robust complementarity between the features of the hand and the object. Through this design， the module effectively integrates occlusion information from both the hand and object regions while employing a correlation matrix to filter out irrelevant background noise. This dual approach ensures that the processes of feature enhancement and mutual complementarity are conducted with a high degree of precision and thoroughness. Consequently， this significantly improves the overall accuracy of pose estimation in scenarios where hand-object interactions are complex and dynamic.3） Through the use of separate hand and object decoders， we are able to recover the poses of the hand and the object independently. These two decoders take into account the interaction effects between the hand and the object during the information fusion process， ensuring that the final output of pose information is characterized by a high degree of accuracy and consistency. This design enables our model to effectively perform pose estimation in complex hand-object interaction scenarios， providing more reliable technical support for practical applications.

Result

Compared to state-of-the-art （SOTA） models， the proposed method demonstrates competitive performance in both hand pose estimation and object pose estimation tasks on the HO3D and Dex-ycb datasets. On the HO3D dataset， when compared with ten recent models， the hand pose estimation metrics PAMPJPE and PAMPVPE show an improvement of 0.1 mm over the next best model HandOccNet. The object pose estimation metric ADD-0.1D also surpasses the suboptimal HFL-Net by 2.1%. On the Dex-ycb dataset， comparisons with seven recent models reveal that the hand pose estimation metrics MPJPE and PAMPJPE improve by 0.2 mm and 0.1 mm over HFL-Net， while the object pose estimation metric ADD-0.1D shows a 6.4% improvement over HFL-Net.

Conclusion

The HOFEC model proposed in this paper aims to improve the accuracy of hand-object pose estimation in interactive scenarios by facilitating the complementary information exchange between the hand and the object. By introducing a channel attention mechanism and incorporating shuffling operations， our model not only addresses the issue of channel information loss in feature pyramid networks （FPN） but also further supplements and strengthens features at different scales.We design a feature enhancement module based on spatial attention to enhance hand and object features at the spatial scale while simultaneously extracting the secondary features of both the hand and the object. Through a cross-attention mechanism， we leverage these secondary features to mutually complement the primary features of the hand and object， effectively filtering out irrelevant background information associated with the secondary features. This approach successfully addresses the challenge of underutilizing occlusion information， thereby improving the accuracy of the hand-object pose estimation task.Building upon this foundation， we develop a hand-object decoder that decodes the hand and object separately， ultimately reconstructing the complete poses of both the hand and the object. Experimental results have shown that even in cases of severe occlusion during hand object interaction， the proposed HOFEC model can still accurately estimate the pose of hands and objects.

关键词

手物姿态估计特征提取网络特征增强特征互补注意力机制

Keywords

hand-object pose estimationfeature extraction networkfeature enhancementfeature complementationattention mechanism

references

Chao Y W，Yang W，Xiang Y，Molchanov P，Handa A，Tremblay J，Narang Y S，Wyk K V，Iqbal U，Birchfield S，Kautz J and Fox D. 2021. DexYCB： A benchmark for capturing hand grasping of objects//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. Nashville，USA： IEEE： 9044-9053［DOI：10.1109/CVPR46437.2021.00893http://dx.doi.org/10.1109/CVPR46437.2021.00893］

Chen H，Tian Wei，Wang P，Wang F，Xiong L and Li H. 2022a. Epro-pnp： Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. New Orleans，USA： IEEE： 2771-2780［DOI：10.1109/CVPR52688.2022.00280http://dx.doi.org/10.1109/CVPR52688.2022.00280］

Chen X，Liu Y，Dong Y，Zhang X，Ma C，Xiong Y，Zhang Y and Guo X. 2022b. Mobrecon： Mobile-friendly hand mesh reconstruction from monocular image//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. New Orleans，USA： IEEE： 20544-20554［DOI：10.1109/CVPR52688.2022.01989http://dx.doi.org/10.1109/CVPR52688.2022.01989］

Chen Y，Tu Z，Kang D，Chen R，Bao L，Zhang Z and Yuan J. 2021. Joint hand-object 3d reconstruction from a single image with cross-branch feature fusion. IEEE Transactions on Image Processing，30： 4008-4021［DOI：10.1109/TIP.2021.3068645http://dx.doi.org/10.1109/TIP.2021.3068645］

Chen Z，Hasson Y，Schmid C and Laptev I. 2022c. Alignsdf： Pose-aligned signed distance fields for hand-object reconstruction//Proceedings of 2022 European Conference on Computer Vision（ECCV）. Tel Aviv，Israel： Springer： 231-248［DOI：10.1007/978-3-031-19769-7_14http://dx.doi.org/10.1007/978-3-031-19769-7_14］

Choi H，Moon G and Lee K M. 2020. Pose2mesh： Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose//Proceedings of 2020 European Conference on Computer Vision（ECCV）. Glasgow，UK： Springer： 769-787［DOI：10.1007/978-3-030-58571-6_45http://dx.doi.org/10.1007/978-3-030-58571-6_45］

Hampali S，Sarkar S D，Rad M and Lepetit V. 2022. Keypoint transformer： Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. New Orleans，USA： IEEE： 11090-11100［DOI：10.1109/CVPR52688.2022.01081http://dx.doi.org/10.1109/CVPR52688.2022.01081］

Hampali S，Rad M，Oberweger M and Lepetit V. 2020. Honnotate： A method for 3d annotation of hand and object poses//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. Seattle，USA： IEEE： 3196-3206［DOI：10.1109/CVPR42600.2020.00326http://dx.doi.org/10.1109/CVPR42600.2020.00326］

Handa A，Allshire A，Makoviychuk V，Petrenko A，Singh R，Liu J，Makoviichuk D，Wyk K V，Zhurkevich A，Sundaralingam B and Narang Y. 2023. Dextreme： Transfer of agile in-hand manipulation from simulation to reality//Proceedings of 2023 IEEE International Conference on Robotics and Automation （ICRA）. London，UK： IEEE： 5977-5984［DOI：10.1109/ICRA48891.2023.10160216http://dx.doi.org/10.1109/ICRA48891.2023.10160216］

Hasson Y，Tekin B，Bogo F，Laptev L，Pollefeys M and Schmid C. 2020. Honnotate： A method for 3d annotation of hand and object poses//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. Seattle，USA： IEEE： 571-580［DOI：10.1109/CVPR42600.2020.00065http://dx.doi.org/10.1109/CVPR42600.2020.00065］

He K，Zhang X，Ren S and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition（CVPR）. Las Vegas，USA： IEEE： 770-778［DOI：10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

He K，Gkioxari G，Dollár P and Girshick R. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE International Conference on Computer Vision（ICCV）. Venice， Italy： IEEE： 2961-2969 ［DOI：10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322］

He Y，Huang H，Fan H，Chen Q and Sun J. 2021. Ffb6d： A full flow bidirectional fusion network for 6d pose estimation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. Nashville，USA： IEEE： 3003-3013［DOI：10.1109/CVPR46437.2021.00302http://dx.doi.org/10.1109/CVPR46437.2021.00302］

Jia D，Li Y Y，An T and Zhao J Y. 2023. Complex gesture pose estimation network fusing multiscale features. Journal of Image and Graphics，28（09）：2887-2898

贾迪，李宇扬，安彤，赵金源. 2023. 融合多尺度特征的复杂手势姿态估计网络. 中国图象图形学报，28（09）： 2887-2898［DOI：10. 11834/jig. 220636http://dx.doi.org/10.11834/jig.220636］

Li J，Pan X，Huang G，Zhang Z，Wang N，Bao H and Zhang G. 2024. RD-VIO： Robust visual-inertial odometry for mobile augmented reality in dynamic environments. 2024 IEEE Transactions on Visualization and Computer Graphics［DOI：10.1109/TVCG.2024.3353263http://dx.doi.org/10.1109/TVCG.2024.3353263］

Li D D，Zheng H R，Liu F C and Pan X. 2022a. 6D pose estimation based on mask location and hourglass network. Journal of Image and Graphics，27（02）：0642-0652

李冬冬，郑河荣，刘复昌，潘翔. 2022. 结合掩码定位和漏斗网络的6D姿态估计. 中国图象图形学报，27（02）：0642- 0652［DOI：10. 11834 / jig. 200525http://dx.doi.org/10.11834/jig.200525］

Li C，Liu W，Guo R，Yin X，Jiang K，Du Y，Zhu L，Lai B，Hu X，Yu D and Ma Y. 2022b. PP-OCRv3： More attempts for the improvement of ultra lightweight OCR system［EB/OL］. https://arxiv.org/abs/2206.03001https://arxiv.org/abs/2206.03001

Lin K，Wang L and Liu Z. 2021. End-to-end human pose and mesh reconstruction with transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. Nashville，USA： IEEE： 1954-1963［DOI： 10.1109/CVPR46437.2021.00199http://dx.doi.org/10.1109/CVPR46437.2021.00199］

Lin Z，Ding C，Yao H，Kuang Z and Huang S. 2023. Harmonious Feature Learning for Interactive Hand-Object Pose Estimation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. Vancouver，Canada： IEEE： 12989-12998［DOI：10.1109/CVPR46437.2021.01445http://dx.doi.org/10.1109/CVPR46437.2021.01445］

Lin T Y，Dollár P，Girshick R，He K，Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition（CVPR）. Honolulu， USA： IEEE： 2117-2125［DOI：10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106］

Liu S，Jiang H，Xu J，Liu S and Wang x. 2021. Semi-supervised 3d hand-object poses estimation with interactions in time//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. Nashville，USA： IEEE： 14687-14697［DOI：10.1109/CVPR46437.2021.01445http://dx.doi.org/10.1109/CVPR46437.2021.01445］

Majewska O Razumovskaia E，Ponti E M，Vulić I and Korhonen A. 2023. Cross-lingual dialogue dataset creation via outline-based generation. Transactions of the Association for Computational Linguistics，11： 139-156［DOI：10.1162/tacl_a_00539http://dx.doi.org/10.1162/tacl_a_00539］

Newell A，Yang K and Deng J. 2016. Stacked Hourglass Networks for Human Pose Estimation//Proceedings of 2016 European Conference on Computer Vision（ECCV）. Amsterdam，The Netherlands： Springer： 483–499［DOI：10.1007/978-3-319-46484-8_29http://dx.doi.org/10.1007/978-3-319-46484-8_29］

Park J K，Oh Y，Moon G，Choi H and Lee K M. 2022. Handoccnet： Occlusion-robust 3d hand mesh estimation network//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. New Orleans，USA： IEEE： 2761-2770［DOI：10.1109/CVPR52688.2022.00155http://dx.doi.org/10.1109/CVPR52688.2022.00155］

Romero J，Tzionas D and Black M J. 2017. Embodied hands： modeling and capturing hands and bodies together. ACM Transactions on Graphics，36（6）：#245［DOI：10.1145/3130800.3130883http://dx.doi.org/10.1145/3130800.3130883］

Spurr A，Iqbal U，Molchanov P，Hilliges O and Kautz J. 2020. Weakly supervised 3d hand pose estimation via biomechanical constraints//Proceedings of 2020 European Conference on Computer Vision（ECCV）. Glasgow，UK： Springer： 211-228［DOI：10.1007/978-3-030-58520-4_13http://dx.doi.org/10.1007/978-3-030-58520-4_13］

Su Y，Saleh M，Fetzer T，Rambach J，Navab N，Busam B，Stricker D and Tombari F. 2022. Zebrapose： Coarse to fine surface encoding for 6dof object pose estimation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. New Orleans，USA： IEEE： 6738-6748［DOI：10.1109/CVPR52688.2022.00662http://dx.doi.org/10.1109/CVPR52688.2022.00662］

Sun J，Wang Z，Zhang S，Zhao H，Zhang G and Zhou X. 2022. Onepose： One-shot object pose estimation without cad models//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. New Orleans，USA： IEEE： 6825-6834 ［DOI：10.1109/CVPR52688.2022.00670http://dx.doi.org/10.1109/CVPR52688.2022.00670］

Wang R，Mao W and Li H. 2023. Interacting Hand-Object Pose Estimation via Dense Mutual Attention//Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision（WACV）. Waikoloa，Canada： IEEE： 5735-5745［DOI：10.1109/WACV56688.2023.00569http://dx.doi.org/10.1109/WACV56688.2023.00569］

Xu H，Wang T，Tang X and Fu C. 2023. H2onet： Hand-occlusion-and-orientation-aware network for real-time 3d hand mesh reconstruction//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. Vancouver，Canada： IEEE： 17048-17058 ［DOI：10.1109/CVPR52729.2023.01635http://dx.doi.org/10.1109/CVPR52729.2023.01635］

Yang L，Li K，Zhan X，Lv J，Xu W，Li J and Lu C. 2022. Artiboost： Boosting articulated 3d hand-object pose estimation via online exploration and synthesis//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）. New Orleans，USA： IEEE： 2750-2760［DOI：10.1109/CVPR52688.2022.00277http://dx.doi.org/10.1109/CVPR52688.2022.00277］

文章被引用时，请邮件提醒。

提交

非视口依赖的抗畸变无参考全景图像质量评价

感受野扩增的轻量级病理图像聚焦质量评估网络

注意力集合表示的多尺度度量小样本图像分类

融合场景先验的船名文本检测方法

红外与可见光图像特征动态选择的目标检测网络