基于深度学习的遮挡人体姿态估计进展综述
A comprehensive review of progress in deep-learning-based occluded human pose estimation
- 2024年29卷第12期 页码:3529-3542
纸质出版日期: 2024-12-16
DOI: 10.11834/jig.230730
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-12-16 ,
移动端阅览
徐琳皓, 赵林, 孙辛欣, 颜克冬, 李广宇. 2024. 基于深度学习的遮挡人体姿态估计进展综述. 中国图象图形学报, 29(12):3529-3542
Xu Linhao, Zhao Lin, Sun Xinxin, Yan Kedong, Li Guangyu. 2024. A comprehensive review of progress in deep-learning-based occluded human pose estimation. Journal of Image and Graphics, 29(12):3529-3542
人体姿态估计(human pose estimation,HPE)是计算机视觉中的一项基本任务,旨在从给定的图像中获取人体关节的空间坐标,在动作识别、语义分割、人机交互和人员重新识别等方面得到了广泛应用。随着深度卷积神经网络(deep convolutional neural network,DCNN)的兴起,人体姿态估计取得了显著进展。然而,尽管取得了不错的成果,人体姿态估计仍然是一项具有挑战性的任务,特别是在面对复杂姿态、关键点尺度的变化和遮挡等因素时。为了总结关于遮挡的人体姿态估计技术的发展,本文系统地概述了自2018年以来的代表性方法,根据神经网络包含的训练数据、模型结构以及输出结果,将方法细分为基于数据增广(data augmentation)的预处理、基于特征区分的结构设计和基于人体先验的结果优化3类。基于数据增广方法通过生成遮挡的数据来增加训练样本;基于特征区分的方法通过利用注意力机制等方式来减少干扰特征;基于人体结构先验的方法通过利用人体结构先验来优化遮挡姿态。同时,为了更好地评测遮挡方法的性能,重新标注了MSCOCO (Microsoft common objects in context)val2017数据集。最后,对各种方法进行了对比和总结,阐明了它们在面对遮挡时性能的优劣。此外,在此基础上总结和讨论了遮挡情况下人体姿态估计困难的原因以及该领域未来的发展趋势。
Human pose estimation (HPE) is a prominent area of research in computer vision whose primary goal is to accurately localize annotated keypoints of the human body, such as wrists and eyes. This fundamental task serves as the basis for numerous downstream applications, including human action recognition, human-computer interaction, pedestrian re-identification, video surveillance, and animation generation, among others. Thanks to the powerful nonlinear mapping capabilities offered by convolutional neural networks, HPE has experienced notable advancements in recent years. Despite this progress, HPE remains a challenging task, particularly when facing complex postures, variations in keypoint scales, occlusion, and other factors. Notably, the current heatmap-based methods suffer from severe performance degradation when encountering occlusion, which remains a critical challenge in HPE given that diverse human postures, complex backgrounds, and various occluding objects can all cause performance degradation. To comprehensively delve into the recent advancements in occlusion-aware HPE, this paper not only explores the intricacies of occlusion prediction difficulties but also delves into the reasons behind these challenges. The identified challenges encompass the absence of annotated occluded data. Annotating occluded data is inherently complex and demanding. Most of the prevalent datasets for HPE predominantly focus on visible keypoints, with only a few datasets addressing and annotating occlusion scenarios. This deficiency in annotated occluded data during model training significantly compromises the robustness of models in effectively handling situations that involve a partial or complete obstruction of body keypoints. Feature confusion presents a key challenge for top-down HPE methods, where the reliance on detected bounding boxes extracted from the image leads to the cropping of the target person’s region for keypoint prediction. However, in the presence of occlusion, these detection boxes may include individuals other than the target person, thereby interfering with the accurate prediction of keypoints. This issue is particularly problematic because the high feature similarity between the target person and the interfering individuals prevents the model from distinguishing features effectively, thereby compromising the accuracy of keypoint predictions and emphasizing the need to develop strategies for addressing feature confusion in occluded scenes. Navigating the intricacies of inference becomes particularly challenging in the presence of substantial occlusion. The expansive coverage of occlusion leads to the loss of valuable contextual and structural information that is essential for accurately predicting the occluded keypoints. Contextual cues and structural insights play pivotal roles in the inference process, and their absence impedes the model’s ability to draw precise conclusions. The significant loss of contextual information also hampers the model’s capacity to glean necessary details from adjacent keypoints, which is crucial for making informed predictions about occluded keypoints. This, in turn, results in the potential omission of keypoints or the emergence of anomalous pose estimations. Besides, this paper systematically reviews representative methods since 2018. Based on the training data, model structure, and output results contained in neural networks, this paper categorizes methods into three types, namely, preprocessing based on data augmentation, structural design based on feature discrimination, and result optimization based on human body priors. Preprocessing based on data augmentation techniques, which generate data with occlusion, are employed to augment training samples, compensate for the lack of annotated occluded data, and alleviate the performance degradation of the model in the presence of occlusion. These techniques utilize synthetic methods to introduce occlusive elements and simulate occlusion scenarios observed in real-world settings. Through these techniques, the model is exposed to a diverse set of samples featuring occlusion during the training process, thereby enhancing its robustness in complex environments. This data augmentation strategy aids the model in understanding and adapting to occluded conditions for keypoint prediction. By incorporating diverse occlusion patterns, the model can learn a broad range of scenarios, thus improving its generalization ability in practical applications. This method not only helps enhance the model’s performance in occluded scenes but also provides comprehensive training to boost its adaptability to complex situations. Feature-discrimination-based methods utilize attention mechanisms and similar techniques to reduce interference features. By strengthening features associated with the target person and suppressing those related to non-target individuals, these methods effectively mitigate the interference caused by feature confusion. These methods rely on mechanisms, such as attention, to selectively emphasize relevant features, thereby allowing the model to focus on distinguishing the keypoint features of the target person from those of interfering individuals. By enhancing the discriminative power of features belonging to the target individual, the model becomes adept at navigating scenarios where feature confusion is prevalent. Methods based on human body structure priors optimize occluded poses by leveraging prior knowledge of the human body structure. The use of human body structure priors is particularly effective in providing valuable information about the structural aspects of the human body. These priors serve as constraints that improve the robustness of the model during the inference process. By incorporating these priors, the model is further informed about the expected configuration of body parts, even in the presence of occlusion. This prior knowledge helps guide the model’s predictions and ensures that the estimated poses adhere closely to anatomically plausible configurations. A comparative analysis is also conducted to highlight the strengths and limitations of each method in handling occlusion. This paper also discusses the challenges inherent to occluded pose estimation and offers some directions for future research in this area.
人体姿态估计(HPE)遮挡数据增广人体结构先验遮挡标注数据不足
human pose estimation(HPE)occlusiondata augmentationhuman structure a prioriinsufficient occlusion labeling data
Bin Y R, Cao X, Chen X Y, Ge Y H, Tai Y, Wang C J, Li J L, Huang F Y, Gao C X and Sang N. 2020. Adversarial semantic data augmentation for human pose estimation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 606-622 [DOI: 10.1007/978-3-030-58529-7_36http://dx.doi.org/10.1007/978-3-030-58529-7_36]
Cao Z, Simon T, Wei S E and Sheikh Y. 2017. Realtime multi-person 2D pose estimation using part affinity fields//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1302-1310 [DOI: 10.1109/cvpr.2017.143http://dx.doi.org/10.1109/cvpr.2017.143]
Chen Y L, Wang Z C, Peng Y X, Zhang Z Q, Yu G and Sun J. 2018. Cascaded pyramid network for multi-person pose estimation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7103-7112 [DOI: 10.1109/cvpr.2018.00742http://dx.doi.org/10.1109/cvpr.2018.00742]
Cheng B W, Xiao B, Wang J D, Shi H H, Huang T S and Zhang L. 2020. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5385-5394 [DOI: 10.1109/cvpr42600.2020.00543http://dx.doi.org/10.1109/cvpr42600.2020.00543]
Chu X, Yang W, Ouyang W L, Ma C, Yuille A L and Wang X G. 2017. Multi-context attention for human pose estimation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5669-5678 [DOI: 10.1109/cvpr.2017.601http://dx.doi.org/10.1109/cvpr.2017.601]
Crasto N, Weinzaepfel P, Alahari K and Schmid C. 2019. MARS: motion-augmented RGB stream for action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7874-7883 [DOI: 10.1109/cvpr.2019.00807http://dx.doi.org/10.1109/cvpr.2019.00807]
Dai Y, Wang X H, Gao L L, Song J K and Shen H T. 2021. RSGNet: relation based skeleton graph network for crowded scenes pose estimation//Proceedings of the 35th AAAI Conference on Artificial Intelligence.Virtual Event, AAAI: 1193-1200 [DOI: 10.1609/aaai.v35i2.16206http://dx.doi.org/10.1609/aaai.v35i2.16206]
Geng Z G, Sun K, Xiao B, Zhang Z X and Wang J D. 2021. Bottom-up human pose estimation via disentangled keypoint regression//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 2021: 14671-14681: [DOI: 10.1109/cvpr46437.2021.01444]
Geng Z G, Wang C Y, Wei Y X, Liu Z, Li H Q and Hu H. 2023. Human pose as compositional tokens//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 660-671 [DOI: 10.1109/cvpr52729.2023.00071http://dx.doi.org/10.1109/cvpr52729.2023.00071]
Gong K, Liang X D, Li Y C, Chen Y M, Yang M and Lin L. 2018. Instance-level human parsing via part grouping network//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 805-822 [DOI: 10.1007/978-3-030-01225-0_47http://dx.doi.org/10.1007/978-3-030-01225-0_47]
Huang J J, Shan Z G, Cai Y H, Guo F, Ye Y, Chen X Z, Zhu Z, Huang G, Lu J W and Du D L. 2020. Joint COCO and LVIS workshop at ECCV 2020: COCO keypoint challenge track technical report: UDP++ [EB/OL]. [2023-06-10]. http://presentations.cocodataset.org.s3.amazonaws.com/ECCV20/keypoints/UDP.pdfhttp://presentations.cocodataset.org.s3.amazonaws.com/ECCV20/keypoints/UDP.pdf
Jiang W T, Jin S, Liu W T, Qian C, Luo P and Liu S. 2022. PoseTrans: a simple yet effective pose transformation augmentation for human pose estimation//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 643-659 [DOI: 10.1007/978-3-031-20065-6_37http://dx.doi.org/10.1007/978-3-031-20065-6_37]
Kan Z H, Chen S S, Li Z and He Z H. 2022. Self-constrained inference optimization on structural groups for human pose estimation//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 729-745 [DOI: 10.1007/978-3-031-20065-6_42http://dx.doi.org/10.1007/978-3-031-20065-6_42]
Kan Z H, Chen S S, Zhang C, Tang Y S and He Z H. 2023. Self-correctable and adaptable inference for generalizable human pose estimation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 5537-5546 [DOI: 10.1109/cvpr52729.2023.00536http://dx.doi.org/10.1109/cvpr52729.2023.00536]
Ke L P, Chang M C, Qi H G and Lyu S W. 2018. Multi-scale structure-aware network for human pose estimation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 731-746 [DOI: 10.1007/978-3-030-01216-8_44http://dx.doi.org/10.1007/978-3-030-01216-8_44]
Kong Y H, Qin Y F and Zhang K. 2023. Deep learning based two-dimension human pose estimation: a critical analysis. Journal of Image and Graphics, 28(7): 1965-1989
孔英会, 秦胤峰, 张珂. 2023. 深度学习二维人体姿态估计方法综述. 中国图象图形学报, 28(7): 1965-1989 [DOI: 10.11834/jig.220436http://dx.doi.org/10.11834/jig.220436]
Kreiss S, Bertoni L and Alahi A. 2019. PifPaf: composite fields for human pose estimation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 11969-11978 [DOI: 10.1109/cvpr.2019.01225http://dx.doi.org/10.1109/cvpr.2019.01225]
Li J F, Wang C, Zhu H, Mao Y H, Fang H S and Lu C W. 2019. CrowdPose: efficient crowded scenes pose estimation and a new benchmark//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 10855-10864 [DOI: 10.1109/cvpr.2019.01112http://dx.doi.org/10.1109/cvpr.2019.01112]
Li Y L, He J F, Zhang T Z, Liu X, Zhang Y D and Wu F. 2021. Diverse part discovery: occluded person re-identification with part-aware Transformer//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 2897-2906 [DOI: 10.1109/cvpr46437.2021.00292http://dx.doi.org/10.1109/cvpr46437.2021.00292]
Liang X D, Gong K, Shen X H and Lin L. 2019. Look into person: joint body parsing and pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4): 871-885 [DOI: 10.1109/tpami.2018.2820063http://dx.doi.org/10.1109/tpami.2018.2820063]
Liang X D, Xu C Y, Shen X H, Yang J C, Liu S, Tang J H, Lin L and Yan S C. 2015. Human parsing with contextualized convolutional neural network//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1386-1394 [DOI: 10.1109/iccv.2015.163http://dx.doi.org/10.1109/iccv.2015.163]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 740-755 [DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu K L, Choi O, Wang J M and Hwang W. 2022. CDGNet: class distribution guided network for human parsing//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 4463-4472 [DOI: 10.1109/cvpr52688.2022.00443http://dx.doi.org/10.1109/cvpr52688.2022.00443]
Luo Y W, Zheng Z D, Zheng L, Guan T, Yu J Q and Yang Y. 2018. Macro-micro adversarial network for human parsing//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 424-440 [DOI: 10.1007/978-3-030-01240-3_26http://dx.doi.org/10.1007/978-3-030-01240-3_26]
Luo Z X, Wang Z C, Huang Y, Wang L, Tan T N and Zhou E J. 2021. Rethinking the heatmap regression for bottom-up human pose estimation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 13259-13268 [DOI: 10.1109/cvpr46437.2021.01306http://dx.doi.org/10.1109/cvpr46437.2021.01306]
Mao W A, Ge Y T, Shen C H, Tian Z, Wang X L, Wang Z B and Van Den Hengel A. 2022. Poseur: direct human pose regression with Transformers//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 72-88 [DOI: 10.1007/978-3-031-20068-7_5http://dx.doi.org/10.1007/978-3-031-20068-7_5]
MMPose Contributors. 2020. OpenMMLab pose estimation toolbox and benchmark [EB/OL]. [2023-06-10]. https://github.com/open-mmlab/mmposehttps://github.com/open-mmlab/mmpose
Newell A, Yang K Y and Deng J. 2016. Stacked hourglass networks for human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 483-499 [DOI: 10.1007/978-3-319-46484-8_29http://dx.doi.org/10.1007/978-3-319-46484-8_29]
Pishchulin L, Insafutdinov E, Tang S Y, Andres B, Andriluka M, Gehler P and Schiele B. 2016. DeepCut: joint subset partition and labeling for multi person pose estimation//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4929-4937. [DOI: 10.1109/cvpr.2016.533http://dx.doi.org/10.1109/cvpr.2016.533]
Qiu L T, Zhang X Y, Li Y R, Li G B, Wu X J, Xiong Z X, Han X G and Cui S G. 2020. Peeking into occluded joints: a novel framework for crowd pose estimation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 488-504 [DOI: 10.1007/978-3-030-58529-7_29http://dx.doi.org/10.1007/978-3-030-58529-7_29]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019. Skeleton-based action recognition with directed graph neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7904-7913 [DOI: 10.1109/cvpr.2019.00810http://dx.doi.org/10.1109/cvpr.2019.00810]
Sun K, Xiao B, Liu D and Wang J D. 2019. Deep high-resolution representation learning for human pose estimation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5686-5696 [DOI: 10.1109/cvpr.2019.00584http://dx.doi.org/10.1109/cvpr.2019.00584]
Wang D K and Zhang S L. 2022. Contextual instance decoupling for robust multi-person pose estimation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 11050-11058 [DOI: 10.1109/cvpr52688.2022.01078http://dx.doi.org/10.1109/cvpr52688.2022.01078]
Wang X H, Gao L L, Dai Y, Zhou Y X and Song J K. 2021. Semantic-aware transfer with instance-adaptive parsing for crowded scenes pose estimation//Proceedings of the 29th ACM International Conference on Multimedia. [s.l.]: ACM: 686-694 [DOI: 10.1145/3474085.3475233http://dx.doi.org/10.1145/3474085.3475233]
Wang Y H, Li M Y, Cai H, Chen W M and Han S. 2022. Lite pose: efficient architecture design for 2D human pose estimation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 13116-13126 [DOI: 10.1109/cvpr52688.2022.01278http://dx.doi.org/10.1109/cvpr52688.2022.01278]
Xu C J, Yu X Y and Wang Z G. 2020. Multi-view human pose estimation in human-robot interaction//Proceedings of the 46th Annual Conference of the IEEE Industrial Electronics Society. Singapore,Singapore: IEEE: 4769-4775 [DOI: 10.1109/iecon43393.2020.9255211http://dx.doi.org/10.1109/iecon43393.2020.9255211]
Xuan S Y and Zhang S L. 2021. Intra-inter camera similarity for unsupervised person re-identification//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 11921-11930 [DOI: 10.1109/CVPR46437.2021.01175http://dx.doi.org/10.1109/CVPR46437.2021.01175]
Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI: #12328 [DOI: 10.1609/aaai.v32i1.12328http://dx.doi.org/10.1609/aaai.v32i1.12328]
Yang W, Li S, Ouyang W L, Li H S and Wang X G. 2017. Learning feature pyramids for human pose estimation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1290-1299 [DOI: 10.1109/iccv.2017.144http://dx.doi.org/10.1109/iccv.2017.144]
Yang X M, Zhou Y H, Zhang S R, Wu K W and Sun Y X. 2019. Human pose estimation based on cross-stage structure. Journal of Image and Graphics, 24(10): 1692-1702
杨兴明, 周亚辉, 张顺然, 吴克伟, 孙永宣. 2019. 跨阶段结构下的人体姿态估计. 中国图象图形学报, 24(10): 1692-1702 [DOI: 10.11834/jig.190028http://dx.doi.org/10.11834/jig.190028]
Ye M, Shen J B, Lin G J, Xiang T, Shao L and Hoi S C H. 2022. Deep learning for person Re-identification: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6): 2872-2893 [DOI: 10.1109/TPAMI.2021.3054775http://dx.doi.org/10.1109/TPAMI.2021.3054775]
Zhang F, Zhu X T, Dai H B, Ye M and Zhu C. 2020. Distribution-aware coordinate representation for human pose estimation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7091-7100 [DOI: 10.1109/cvpr42600.2020.00712http://dx.doi.org/10.1109/cvpr42600.2020.00712]
Zhang S H, Li R L, Dong X, Rosin P, Cai Z X, Han X, Yang D C, Huang H Z and Hu S M. 2019. Pose2seg: detection free human instance segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 889-898 [DOI: 10.1109/cvpr.2019.00098http://dx.doi.org/10.1109/cvpr.2019.00098]
Zhao L, Wang N N, Gong C, Yang J and Gao X B. 2021. Estimating human pose efficiently by parallel pyramid networks. IEEE Transactions on Image Processing, 30: 6785-6800 [DOI: 10.1109/TIP.2021.3097836http://dx.doi.org/10.1109/TIP.2021.3097836]
Zhou L, Chen Y Y, Gao Y Z, Wang J Q and Lu H Q. 2020. Occlusion-aware Siamese network for human pose estimation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 396-412 [DOI: 10.1007/978-3-030-58565-5_24http://dx.doi.org/10.1007/978-3-030-58565-5_24]
相关作者
相关机构