头姿鲁棒的双一致性约束半监督表情识别
Semi-supervised facial expression recognition robust to head pose empowered by dual consistency constraints
- 2025年30卷第2期 页码:435-450
纸质出版日期: 2025-02-16
DOI: 10.11834/jig.240205
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2025-02-16 ,
移动端阅览
王宇键, 何军, 张建勋, 孙仁浩, 刘学亮. 2025. 头姿鲁棒的双一致性约束半监督表情识别. 中国图象图形学报, 30(02):0435-0450
Wang Yujian, He Jun, Zhang Jianxun, Sun Renhao, Liu Xueliang. 2025. Semi-supervised facial expression recognition robust to head pose empowered by dual consistency constraints. Journal of Image and Graphics, 30(02):0435-0450
目的
2
现有表情识别方法聚焦提升模型的整体识别准确率,对方法的头部姿态鲁棒性研究不充分。在实际应用中,人的头部姿态往往变化多样,影响表情识别效果,因此研究头部姿态对表情识别的影响,并提升模型在该方面的鲁棒性显得尤为重要。为此,在深入分析头部姿态对表情识别影响的基础上,提出一种能够基于无标签非正脸表情数据提升模型头部姿态鲁棒性的半监督表情识别方法。
方法
2
首先按头部姿态对典型表情识别数据集AffectNet重新划分,构建了AffectNet-Yaw数据集,支持在不同角度上进行模型精度测试,提升了模型对比公平性。其次,提出一种基于双一致性约束的半监督表情识别方法(dual-consistency semi-supervised learning for facial expression recognition,DCSSL),利用空间一致性模块对翻转前后人脸图像的类别激活一致性进行空间约束,使模型训练时更关注面部表情关键区域特征;利用语义一致性模块通过非对称数据增强和自学式学习方法不断地筛选高质量非正脸数据用于模型优化。在无需对非正脸表情数据人工标注的情况下,方法直接从有标签正脸数据和无标签非正脸数据中学习。最后,联合优化了交叉熵损失、空间一致性约束损失和语义一致性约束损失函数,以确保有监督学习和半监督学习之间的平衡。
结果
2
实验结果表明,头部姿态对自然场景表情识别有显著影响;提出AffectNet-Yaw具有更均衡的头部姿态分布,有效促进了对这种影响的全面评估;DCSSL方法结合空间一致性和语义一致性约束充分利用无标签非正脸表情数据,显著提高了模型在头部姿态变化下的鲁棒性,较MA-NET(multi-scale and local attention network)和EfficientFace全监督方法,平均表情识别精度分别提升了5.40%和17.01%。
结论
2
本文提出的双一致性半监督方法能充分利用正脸和非正脸数据,显著提升了模型在头部姿态变化下的表情识别精度;新数据集有效支撑了对头部姿态对表情识别影响的全面评估。
Objective
2
The field of facial expression recognition (FER) has long been a vibrant area of research, with a focus on improving the accuracy of identifying expressions across a wide range of faces. However, despite these advancements, a crucial aspect that has not been adequately explored is the robustness of FER models to changes in head pose. In real-world applications, where faces are captured at various angles and poses, existing methods often struggle to accurately recognize expressions in faces with considerable pose variations. This limitation has created an urgent need to understand the extent to which head pose affects FER models and to develop robust models that can handle diverse poses effectively. First, the impact of head pose on FER was analyzed. Rigorous experimentation has provided strong evidence that existing FER approaches are indeed vulnerable when faced with faces exhibiting large head poses. This vulnerability not only limits the practical applicability of these methods but also highlights the critical need for research focused on enhancing the pose robustness of FER models. This challenge is addressed by introducing a semi-supervised framework that leverages unlabeled nonfrontal facial expression samples to increase the pose robustness of FER models. This framework aims to overcome the limitations of existing methods by exploring unlabeled data to supplement labeled frontal face data, allowing the model to learn representations that are invariant to head pose variations. Incorporating unlabeled data expands the model’s exposure to a wider range of poses, ultimately enhancing robustness and accuracy in FER. This study highlights the importance of pose robustness in FER and proposes a semi-supervised framework to address this critical limitation. Rigorous experimentation and analysis provide insights into the impact of head pose on FER, and a robust model to accurately recognize facial expressions across diverse poses is developed. This approach paves the way for more practical and reliable FER systems in real-world applications.
Method
2
The AffectNet dataset was reorganized via a deterministic resampling procedure to examine the impact of head pose on FER. In this procedure, the same numbers of faces from different expression categories and head pose intervals were uniformly and randomly sampled to build a new challenging FER dataset called AffectNet-Yaw, whose test samples were balanced both in the category axis and the head pose axis. The AffectNet-Yaw dataset enables a deep investigation into how head pose affects the performance of an FER model. A semi-supervised framework for FER called dual consistency constraints (DCSSLs) was used to improve the robustness of the model to head poses. This framework leverages a spatial consistency module to force the model to produce consistent category activation maps for each face and its flipped mirror during training, which allows the model to prioritize the key facial regions for FER. A semantic consistency module is also employed to force the model to extract consistent features of two augmentations of the same face that exhibit similar semantics. In particular, two different data augmentations were applied to a face. One of the arguments is weak, and the other is strong. The weakly augmented face was flipped, and model predictions of this sample and its flipped mirror were obtained. Only those unlabeled nonfrontal faces for which the model makes the same prediction with high confidence are retained. Their predicted categories together with their strongly augmented variant comprise “date-label” pairs that are utilized for model training as pseudolabeled positives. This scheme increases data variation to benefit model optimization. Within the framework, a joint optimization target that can integrate cross-entropy loss, the spatial consistency constraint, and the semantic consistency constraint was devised to balance supervised learning and semi-supervised learning. Owing to the joint training, the proposed framework requires no manual labeling of nonfrontal faces; instead, it directly learns from labeled frontal faces and unlabeled nonfrontal faces, highly boosting its robustness and generalizability.
Result
2
The evaluation results of various fully supervised FER methods on both the AffectNet and AffectNet-Yaw datasets underscore the profound impact of head pose variability in real-world scenarios, emphasizing the critical need to increase the robustness of the FER model to such challenges. Empirical findings confirm that the AffectNet-Yaw dataset serves as a rigorous and effective platform for comprehensive investigations into how head pose influences FER model performance. Comparative analyses between baseline models and state-of-the-art methods on the AffectNet and AffectNet-Yaw datasets reveal compelling insights. In particular, the novel DCSSL framework considerably enhances the model’s ability to adapt to head pose variations, showing substantial improvement over existing benchmarks. When MA-NET and EfficientFace are used as benchmarks, the DCSSL framework achieves considerable average performance improvements of 5.40% and 17.01%, respectively. In addition, the effectiveness of this approach was determined by comparing the models that have performed well in the field of expression recognition in the last three years on two separate datasets. In terms of weighting parameter settings, different weighting choices have a considerable effect on model performance. A series of parameter selection experiments were conducted via the control variable approach, and the model achieves optimal expression recognition performance on the AffectNet-Yaw test data when the weight of the spatial consistency constraint loss is set to 1 and that of the semantic consistency constraint loss is set to 5. These results highlight the efficacy of the proposed DCSSL framework in mitigating the detrimental effects of head pose variations on FER accuracy. When spatial and semantic consistency modules are integrated, the proposed approach can not only improve model robustness but also demonstrate its ability to adapt and generalize effectively across diverse head poses encountered in real-world applications. This study not only contributes to a challenging new dataset named AffectNet-Yaw for advancing FER research under realistic conditions but also establishes a novel methodology named DCSSL that sets a new standard for addressing head pose challenges in FER. These advancements are pivotal for enhancing the reliability and applicability of FER systems in practical settings where head pose variability is prevalent.
Conclusion
2
The proposed DCSSL framework in this work can efficiently exploit both frontal and nonfrontal faces, successfully increasing the accuracy of FER under diverse head poses. The new AffectNet-Yaw dataset has a more balanced data distribution, both along the category axis and the head pose axis, enabling a comprehensive study of the impact of head poses on FER. Both of these elements hold substantial value for building robust FER models.
Ariz M , Villanueva A and Cabeza R . 2019 . Robust and accurate 2D-tracking-based 3D positioning method: application to head pose estimation . Computer Vision and Image Understanding , 180 : 13 - 22 [ DOI: 10.1016/j.cviu.2019.01.002 http://dx.doi.org/10.1016/j.cviu.2019.01.002 ]
Berthelot D , Carlini N , Cubuk E D , Kurakin A , Sohn K , Zhang H and Raffel C . 2020 . ReMixMatch: semi-supervised learning with distribution alignment and augmentation anchoring [EB/OL]. [ 2024-01-03 ]. https://arxiv.org/pdf/1911.09785.pdf https://arxiv.org/pdf/1911.09785.pdf
Cai J , Meng Z B , Khan A S , Li Z Y , O’Reilly J and Tong Y . 2023 . Probabilistic attribute tree structured convolutional neural networks for facial expression recognition in the wild . IEEE Transactions on Affective Computing , 14 ( 3 ): 1927 - 1941 [ DOI: 10.1109/TAFFC.2022.3156920 http://dx.doi.org/10.1109/TAFFC.2022.3156920 ]
Chan T H , Jia K , Gao S H , Lu J W , Zeng Z N and Ma Y . 2015 . PCANet: a simple deep learning baseline for image classification? IEEE Transactions on Image Processing , 24 ( 12 ): 5017 - 5032 [ DOI: 10.1109/TIP.2015.2475625 http://dx.doi.org/10.1109/TIP.2015.2475625 ]
Cubuk E D , Zoph B , Shlens J and Le Q V . 2020 . Randaugment: practical automated data augmentation with a reduced search space // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Seattle, USA : IEEE: 3008 - 3017 [ DOI: 10.1109/CVPRW50498.2020.00359 http://dx.doi.org/10.1109/CVPRW50498.2020.00359 ]
Cui X Y , He C , Zhao H K and Wang M L . 2024 . Combining ViT with contrastive learning for facial expression recognition . Journal of Image and Graphics , 29 ( 1 ): 123 - 133
崔鑫宇 , 何翀 , 赵宏珂 , 王美丽 . 2024 . 融合ViT与对比学习的面部表情识别 . 中国图象图形学报 , 29 ( 1 ): 123 - 133 [ DOI: 10.11834/jig.230043 http://dx.doi.org/10.11834/jig.230043 ]
Derbali M , Jarrah M and Randhawa P . 2023 . Autism spectrum disorder detection: video games based facial expression diagnosis using deep learning . International Journal of Advanced Computer Science and Applications , 14 ( 1 ): 110 - 119 [ DOI: 10.14569/IJACSA.2023.0140112 http://dx.doi.org/10.14569/IJACSA.2023.0140112 ]
Farzaneh A H and Qi X J . 2021 . Facial expression recognition in the wild via deep attentive center loss // Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision . Waikoloa, USA : IEEE: 2401 - 2410 [ DOI: 10.1109/WACV48630.2021.00245 http://dx.doi.org/10.1109/WACV48630.2021.00245 ]
Guo Y D , Zhang L , Hu Y X , He X D and Gao J F . 2016 . MS-Celeb-1M: a dataset and benchmark for large-scale face recognition // Proceedings of the 14th European Conference on Computer Vision . Amsterdam, the Netherlands : Springer: 87 - 102 [ DOI: 10.1007/978-3-319-46487-9_6 http://dx.doi.org/10.1007/978-3-319-46487-9_6 ]
He K M , Zhang X Y , Ren S Q and Sun J . 2016 . Deep residual learning for image recognition // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas, USA : IEEE: 770 - 778 [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Jaiswal A , Babu A R , Zadeh M Z , Banerjee D and Makedon F . 2021 . A survey on contrastive self-supervised learning . Technologies , 9 ( 1 ): # 2 [ DOI: 10.3390/technologies9010002 http://dx.doi.org/10.3390/technologies9010002 ]
Jeong M and Ko B C . 2018 . Driver’s facial expression recognition in real-time for safe driving . Sensors , 18 ( 12 ): # 4270 [ DOI: 10.3390/s18124270 http://dx.doi.org/10.3390/s18124270 ]
Kim S , Kim D , Cho M and Kwak S . 2022 . Self-taught metric learning without labels // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, USA : IEEE: 7421 - 7431 [ DOI: 10.1109/CVPR52688.2022.00728 http://dx.doi.org/10.1109/CVPR52688.2022.00728 ]
Lai H J , Tang Z H and Zhang X Y . 2023 . RepEPnP: weakly supervised 3D human pose estimation with EPnP algorithm // Proceedings of 2023 International Joint Conference on Neural Networks (IJCNN) . Gold Coast, Australia : IEEE: 1 - 8 [ DOI: 10.1109/IJCNN54540.2023.10191300 http://dx.doi.org/10.1109/IJCNN54540.2023.10191300 ]
Le N , Nguyen K , Tran Q , Tjiputra E , Le B and Nguyen A . 2023 . Uncertainty-aware label distribution learning for facial expression recognition // Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa, USA : IEEE: 6077 - 6086 [ DOI: 10.1109/WACV56688.2023.00603 http://dx.doi.org/10.1109/WACV56688.2023.00603 ]
Liu H W , Cai H L , Lin Q C , Zhang X W , Li X F and Xiao H . 2023a . FEDA: fine-grained emotion difference analysis for facial expression recognition . Biomedical Signal Processing and Control , 79 : # 104209 [ DOI: 10.1016/j.bspc.2022.104209 http://dx.doi.org/10.1016/j.bspc.2022.104209 ]
Liu S , Xu Y , Wan T M and Kui X Y . 2023b . A dual-branch adaptive distribution fusion framework for real-world facial expression recognition // Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Rhodes Island, Greece : IEEE: 1 - 5 [ DOI: 10.1109/ICASSP49357.2023.10097033 http://dx.doi.org/10.1109/ICASSP49357.2023.10097033 ]
Liu X , Zhang F J , Hou Z Y , Mian L , Wang Z Y , Zhang J and Tang J . 2023c . Self-supervised learning: generative or contrastive . IEEE Transactions on Knowledge and Data Engineering , 35 ( 1 ): 857 - 876 [ DOI: 10.1109/TKDE.2021.3090866 http://dx.doi.org/10.1109/TKDE.2021.3090866 ]
Lopes A T , De Aguiar E , De Souza A F and Oliveira-Santos T . 2017 . Facial expression recognition with Convolutional Neural Networks: coping with few data and the training sample order . Pattern Recognition , 61 : 610 - 628 [ DOI: 10.1016/j.patcog.2016.07.026 http://dx.doi.org/10.1016/j.patcog.2016.07.026 ]
Loshchilov I and Hutter F . 2017 . Stochastic gradient descent with warm restarts [EB/OL]. [ 2024-01-03 ]. https://arxiv.org/pdf/1608.03983v3.pdf https://arxiv.org/pdf/1608.03983v3.pdf
Ma F , Li Y , Ni S G , Huang S L and Zhang L . 2022 . Data augmentation for audio-visual emotion recognition with an efficient multimodal conditional GAN . Applied Sciences , 12 ( 1 ): # 527 [ DOI: 10.3390/app12010527 http://dx.doi.org/10.3390/app12010527 ]
Makhmudkhujaev F , Abdullah-Al-Wadud M , Iqbal M T B , Ryu B and Chae O . 2019 . Facial expression recognition with local prominent directional pattern . Signal Processing : Image Communication , 74 : 1 - 12 [ DOI: 10.1016/j.image.2019.01.002 http://dx.doi.org/10.1016/j.image.2019.01.002 ]
Moin A , Aadil F , Ali Z and Kang D . 2023 . Emotion recognition framework using multiple modalities for an effective human–computer interaction . The Journal of Supercomputing , 79 ( 8 ): 9320 - 9349 [ DOI: 10.1007/s11227-022-05026-w http://dx.doi.org/10.1007/s11227-022-05026-w ]
Mollahosseini A , Hasani B and Mahoor M H . 2019 . AffectNet: a database for facial expression, valence, and arousal computing in the wild . IEEE Transactions on Affective Computing , 10 ( 1 ): 18 - 31 [ DOI: 10.1109/TAFFC.2017.2740923 http://dx.doi.org/10.1109/TAFFC.2017.2740923 ]
Psaroudakis A and Kollias D . 2022 . MixAugment and Mixup: augmentation methods for facial expression recognition // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . New Orleans, USA : IEEE: 2366 - 2374 [ DOI: 10.1109/CVPRW56347.2022.00264 http://dx.doi.org/10.1109/CVPRW56347.2022.00264 ]
Raina R , Battle A , Lee H , Packer B and Ng A Y . 2007 . Self-taught learning: transfer learning from unlabeled data // Proceedings of the 24th International Conference on Machine Learning . Corvallis, USA : ACM: 759 - 766 [ DOI: 10.1145/1273496.1273592 http://dx.doi.org/10.1145/1273496.1273592 ]
Rao Q Y , Qu X , Mao Q R and Zhan Y Z . 2015 . Multi-pose facial expression recognition based on SURF boosting // Proceedings of 2015 International Conference on Affective Computing and Intelligent Interaction (ACII) . Xi’an, China : IEEE: 630 - 635 [ DOI: 10.1109/ACII.2015.7344635 http://dx.doi.org/10.1109/ACII.2015.7344635 ]
Robbins H and Monro S . 1951 . A stochastic approximation method . The Annals of Mathematical Statistics , 22 ( 3 ): 400 - 407
Ruan D L , Yan Y , Lai S Q , Chai Z H , Shen C H and Wang H Z . 2021 . Feature decomposition and reconstruction learning for effective facial expression recognition // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 7656 - 7665 [ DOI: 10.1109/CVPR46437.2021.00757 http://dx.doi.org/10.1109/CVPR46437.2021.00757 ]
Rudovic O , Pantic M and Patras I . 2013 . Coupled Gaussian processes for pose-invariant facial expression recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence , 35 ( 6 ): 1357 - 1369 [ DOI: 10.1109/TPAMI.2012.233 http://dx.doi.org/10.1109/TPAMI.2012.233 ]
Stoychev S and Gunes H . 2022 . The effect of model compression on fairness in facial expression recognition [EB/OL]. [ 2024-01-03 ]. https://arxiv.org/pdf/2201.01709.pdf https://arxiv.org/pdf/2201.01709.pdf
Vieriu R L , Tulyakov S , Semeniuta S , Sangineto E and Sebe N . 2015 . Facial expression recognition under a wide range of head poses // Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) . Ljubljana, Slovenia : IEEE: 1 - 7 [ DOI: 10.1109/fg.2015.7163098 http://dx.doi.org/10.1109/fg.2015.7163098 ]
Wang C , Wang S F and Liang G . 2019 . Identity- and pose-robust facial expression recognition through adversarial feature learning // Proceedings of the 27th ACM International Conference on Multimedia . Nice, France : ACM: 238 - 246 [ DOI: 10.1145/3343031.3350872 http://dx.doi.org/10.1145/3343031.3350872 ]
Wang G T , Li J , Wu Z J , Xu J H , Shen J F and Yang W K . 2023 . EfficientFace: an efficient deep network with feature enhancement for accurate face detection . Multimedia Systems , 29 ( 5 ): 2825 - 2839 [ DOI: 10.1007/s00530-023-01134-6 http://dx.doi.org/10.1007/s00530-023-01134-6 ]
Wang H , Nie F P and Huang H . 2013 . Robust and discriminative self-taught learning // Proceedings of the 30th International Conference on Machine Learning . Atlanta, USA : PMLR: 298 - 306
Wang K , Peng X J , Yang J F , Meng D B and Qiao Y . 2020a . Region attention networks for pose and occlusion robust facial expression recognition . IEEE Transactions on Image Processing , 29 : 4057 - 4069 [ DOI: 10.1109/tip.2019.2956143 http://dx.doi.org/10.1109/tip.2019.2956143 ]
Wang K , Peng X J , Yang J F , Lu S J and Qiao Y . 2020b . Suppressing uncertainties for large-scale facial expression recognition // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 6896 - 6905 [ DOI: 10.1109/CVPR42600.2020.00693 http://dx.doi.org/10.1109/CVPR42600.2020.00693 ]
Wen Z Y , Lin W Z , Wang T and Xu G . 2023 . Distract your attention: multi-head cross attention network for facial expression recognition . Biomimetics , 8 ( 2 ): # 199 [ DOI: 10.3390/biomimetics8020199 http://dx.doi.org/10.3390/biomimetics8020199 ]
Yan K Y , Zheng W M , Zhang T , Zong Y , Tang C G , Lu C and Cui Z . 2019 . Cross-domain facial expression recognition based on transductive deep transfer learning . IEEE Access , 7 : 108906 - 108915 [ DOI: 10.1109/ACCESS.2019.2930359 http://dx.doi.org/10.1109/ACCESS.2019.2930359 ]
Yang X P , Yang H N , Li J T and Wang S J . 2023 . Simple but effective in-the-wild micro-expression spotting based on head pose segmentation // Proceedings of the 3rd Workshop on Facial Micro-Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis . Ottawa, Canada : ACM: 9 - 16 [ DOI: 10.1145/3607829.3616445 http://dx.doi.org/10.1145/3607829.3616445 ]
Zhang F F , Zhang T Z , Mao Q R and Xu C S . 2018 . Joint pose and expression modeling for facial expression recognition // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 3359 - 3368 [ DOI: 10.1109/CVPR.2018.00354 http://dx.doi.org/10.1109/CVPR.2018.00354 ]
Zhang S N , Zhang Y H , Zhang Y , Wang Y F and Song Z G . 2023 . A dual-direction attention mixed feature network for facial expression recognition . Electronics , 12 ( 17 ): # 3595 [ DOI: 10.3390/electronics12173595 http://dx.doi.org/10.3390/electronics12173595 ]
Zhang Y H , Wang C R , Ling X and Deng W H . 2022 . Learn from all: erasing attention consistency for noisy label facial expression recognition // Proceedings of the 17th European Conference on Computer Vision (ECCV) . Tel Aviv, Israel : Springer: 418 - 434 [ DOI: 10.1007/978-3-031-19809-0_24 http://dx.doi.org/10.1007/978-3-031-19809-0_24 ]
Zhao M H , Dong S S , Hu J , Du S L , Shi C , Li P and Shi Z H . 2024 . Attention-guided three-stream convolutional neural network for microexpression recognition . Journal of Image and Graphics , 29 ( 1 ): 111 - 122
赵明华 , 董爽爽 , 胡静 , 都双丽 , 石程 , 李鹏 , 石争浩 . 2024 . 注意力引导的三流卷积神经网络用于微表情识别 . 中国图象图形学报 , 29 ( 1 ): 111 - 122 [ DOI: 10.11834/jig.230053 http://dx.doi.org/10.11834/jig.230053 ]
Zhao Z Q , Liu Q S and Wang S M . 2021 . Learning deep global multi-scale and local attention features for facial expression recognition in the wild . IEEE Transactions on Image Processing , 30 : 6544 - 6556 [ DOI: 10.1109/tip.2021.3093397 http://dx.doi.org/10.1109/tip.2021.3093397 ]
Zheng C , Mendieta M and Chen C . 2023 . POSTER: a pyramid cross-fusion transformer network for facial expression recognition // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) . Paris, France : IEEE: 3138 - 3147 [ DOI: 10.1109/ICCVW60793.2023.00339 http://dx.doi.org/10.1109/ICCVW60793.2023.00339 ]
相关作者
相关机构