注意力引导局部特征联合学习的人脸表情识别

卢莉丹; 夏海英; 谭玉枚; 宋树祥

doi:10.11834/jig.230410

图像分析和识别 | 浏览量 : 0 下载量: 325 CSCD: 0

PDF
导出
分享
收藏
专辑

注意力引导局部特征联合学习的人脸表情识别
Attention-guided local feature joint learning for facial expression recognition
2024年29卷第8期页码：2377-2387
收稿日期：2023-06-30，

修回日期：2023-11-07，

纸质出版日期：2024-08-16
DOI： 10.11834/jig.230410
稿件说明：

移动端阅览

卢莉丹，夏海英，谭玉枚，宋树祥. 2024. 注意力引导局部特征联合学习的人脸表情识别. 中国图象图形学报， 29(08):2377-2387 DOI： 10.11834/jig.230410.

Lu Lidan， Xia Haiying， Tan Yumei， Song Shuxiang. 2024. Attention-guided local feature joint learning for facial expression recognition. Journal of Image and Graphics， 29(08):2377-2387 DOI： 10.11834/jig.230410.

摘要

目的

在复杂的自然场景下，人脸表情识别存在着眼镜、手部动作和发型等局部遮挡的问题，这些遮挡区域会降低模型的情感判别能力。因此，本文提出了一种注意力引导局部特征联合学习的人脸表情识别方法。

方法

该方法由全局特征提取模块、全局特征增强模块和局部特征联合学习模块组成。全局特征提取模块用于提取中间层全局特征；全局特征增强模块用于抑制人脸识别预训练模型带来的冗余特征，并增强全局人脸图像中与情感最相关的特征图语义信息；局部特征联合学习模块利用混合注意力机制来学习不同人脸局部区域的细粒度显著特征并使用联合损失进行约束。

结果

在2个自然场景数据集RAF-DB（real-world affective faces database）和FERPlus上进行了相关实验验证。在RAF-DB数据集中，识别准确率为89.24%，与MA-Net（global multi-scale and local attention network）相比有0.84%的性能提升；在FERPlus数据集中，识别准确率为90.04%，与FER-VT（FER framework with two attention mechanisms）的性能相当。实验结果表明该方法具有良好的鲁棒性。

结论

本文方法通过先全局增强后局部细化的学习顺序，有效地减少了局部遮挡问题的干扰。

Abstract

Objective

When communicating face to face， people use various methods to convey their inner emotions， such as conversational tone， body movements， and facial expressions. Among these methods， facial expression is the most direct means of observing human emotions. People can convey their thoughts and feelings through facial expression， and they can also use it to recognize others’ attitudes and inner world. Therefore， facial expression recognition belongs to one of the research directions in the field of affective computing. It can obviously be applied to many fields， such as fatigue driving detection， human–computer interaction， students’ listening state analysis， and intelligent medical services. However， in complex natural situations， facial expression recognition suffers from direct occlusion issues such as masks， sunglasses， gestures， hairstyles， or beards， as well as indirect occlusion issues such as different lighting， complex backgrounds， and pose variation. All these concerns can pose great challenges to facial expression recognition in natural scenes， where extracting discriminative features is difficult. Thus， the final recognition results are poor. Therefore， we propose an attention-guided joint learning method for local features in facial expression recognition to reduce the interference of occlusion and pose variation problems.

Method

Our method is composed of a global feature extraction module， a global feature enhancement module， and a joint learning module for local features. First， we use ResNet-50 as the backbone network and initialize the network parameters using the MS-Celeb-1M face recognition dataset. We think that the rich information available in the face recognition model can be used to complement the contextual information needed for facial expression recognition， especially the middle layer features such as eyes， nose， and mouth. Thus， the global feature extraction module is used to extract the global features of the middle layer， which consists of a 2D convolutional layer and three bottleneck residual convolutional blocks. Second， most of the facial expression features are concentrated in localized key regions such as eyes， nose， and mouth. Accordingly， the overall face information can be ignored and the expression categories can be directly recognized correctly with the help of local key information. Given that face recognition requires overall facial information， the face recognition pretraining model introduces some unimportant features for expression recognition. Therefore， we utilize a global feature enhancement module to suppress the redundant features （e.g.， features in the nose region） brought by the pretrained model for face recognition and enhance the semantic information of global face image that is most relevant to the emotion. This module is implemented by the effcient channel attention（ECA） attention mechanism， which strengthens the channel features that contribute to the classification and weakens the weights of the channel features that are detrimental to the classification through cross-channel interactions between high-level semantic channel features. Finally， we divide the output features of the global feature enhancement module into four non-overlapping local regions uniformly in terms of spatial dimensions. This method exactly distributes the eye and mouth regions in most of the face images in four sub-image blocks. The global facial expression analysis problem is split into multiple local regions for calculation. Then， the fine-grained salient features of different localized regions of the face are learned through the mixed-attention mechanism. The local feature joint learning module learns information from complementary contexts， which reduces the negative effects of occlusion and pose variations. Considering that our method integrates four classifiers for local feature learning， a decision-level fusion strategy is used for the final prediction. That is， after summing the output probability results of the four classifiers， the category corresponding to the maximum probability is the model prediction category.

Result

Relevant experimental validation was performed on two in-the-wild expression datasets， namely， real-world affective faces database （RAF-DB） and face expression recognition plus （FERPlus） datasets. The results of the ablation experiments show that the gains of our method compared with the base model on the two datasets are 1.89% and 2.47%， respectively. In the RAF-DB dataset， the recognition accuracy is 89.24%， which has a performance improvement of 0.84% compared with global multi-scale and local attention network （MA-Net）. In the FERPlus dataset， the recognition accuracy is 90.04%， which is comparable to the performance of FER framework with two attention mechanisms （FER-VT）. Therefore， our method has good robustness. We test the model trained on the RAF-DB dataset by incorporating it with the FED-RO dataset with real occlusion and achieved an accuracy of 67.60%. We also use Grad-CAM++ to visualize the attention heatmap of the proposed model for demonstrating the effectiveness of the proposed method more intuitively. The visualization of the joint learning module for local features illustrates that the module can direct the overall model to focus on the features in each individual local image block that are useful for classification.

Conclusion

In general， the proposed method is guided by the attention mechanism， which enhances the global features first and then learns the salient features in the local region. This approach effectively reduces the interference of the local occlusion problem through the learning sequence of global enhancement followed by local refinement. Experiments on two natural scene datasets and the occlusion test set prove that the model is simple， effective， and robust.

关键词

Keywords

references

Barsoum E ， Zhang C ， Ferrer C C and Zhang Z Y . 2016 . Training deep networks for facial expression recognition with crowd-sourced label distribution // Proceedings of the 18th ACM International Conference on Multimodal Interaction . Tokyo， Japan ： ACM： 279 - 283 ［ DOI： 10.1145/2993148.2993165 http://dx.doi.org/10.1145/2993148.2993165 ］

Chattopadhay A ， Sarkar A ， Howlader P and Balasubramanian V N . 2018 . Grad-CAM++： generalized gradient-based visual explanations for deep convolutional networks // Proceedings of 2018 IEEE winter Conference on Applications of Computer Vision （WACV） . Lake Tahoe， USA ： IEEE： 839 - 847 ［ DOI： 10.1109/WACV.2018.0009 http://dx.doi.org/10.1109/WACV.2018.0009 ］

Chen T ， Xing S ， Yang W W and Jin J Q . 2022 . Spatio-temporal features based human facial expression recognition . Journal of Image and Graphics ， 27 （ 7 ）： 2185 - 2198

陈拓，邢帅，杨文武，金剑秋 . 2022 . 融合时空域特征的人脸表情识别 . 中国图象图形学报， 27 （ 7 ）： 2185 - 2198 ［ DOI： 10.11834/jig.200782 http://dx.doi.org/10.11834/jig.200782 ］

Corneanu C A ， Simón M O ， Cohn J F and Guerrero S E . 2016 . Survey on RGB， 3D， thermal， and multimodal approaches for facial expression recognition： history， trends， and affect-related applications . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 38 （ 8 ）： 1548 - 1568 ［ DOI： 10.1109/TPAMI.2016.251560 http://dx.doi.org/10.1109/TPAMI.2016.251560 ］

Ding H ， Zhou P and Chellappa R . 2020 . Occlusion-adaptive deep network for robust facial expression recognition // Proceedings of 2020 IEEE International Joint Conference on Biometrics （IJCB） . Houston， USA ： IEEE： 1 - 9 ［ DOI： 10.1109/IJCB48548.2020.9304923 http://dx.doi.org/10.1109/IJCB48548.2020.9304923 ］

Ekman P and Friesen W V . 1971 . Constants across cultures in the face and emotion . Journal of Personality and Social Psychology ， 17 （ 2 ）： 124 - 129 ［ DOI： 10.1037/h0030377 http://dx.doi.org/10.1037/h0030377 ］

Fan Y R ， Lam J C K and Li V O K . 2018 . Multi-region ensemble convolutional neural network for facial expression recognition // Proceedings of the 27th International Conference on Artificial Neural Networks . Rhodes， Greece ： Springer： 84 - 94 ［ DOI： 10.1007/978-3-030-01418-6_9 http://dx.doi.org/10.1007/978-3-030-01418-6_9 ］

Goodfellow I J ， Erhan D ， Carrier P L ， Courville A ， Mirza M ， Hamner B ， Cukierski W ， Tang Y C ， Thaler D ， Lee D H ， Zhou Y B ， Ramaiah C ， Feng F X ， Li R F ， Wang X J ， Athanasakis D ， Shawe-Taylor J ， Milakov M ， Park J ， Ionescu R ， Popescu M ， Grozea C ， Bergstra J ， Xie J J ， Romaszko L ， Xu B ， Chuang Z and Bengio Y . 2013 . Challenges in representation learning： a report on three machine learning contests // Proceedings of the 20th International Conference on Neural Information Processing . Daegu， Korea（South）： Springer： 117 - 124 ［ DOI： 10.1007/978-3-642-42051-1_16 http://dx.doi.org/10.1007/978-3-642-42051-1_16 ］

Guo Y D ， Zhang L ， Hu Y X ， He X D and Gao J F . 2016 . MS-Celeb-1M： a dataset and benchmark for large-scale face recognition // Proceedings of the 15th European Conference on Computer Vision-ECCV 2016 . Amsterdam， the Netherlands ： Springer： 87 - 102 ［ DOI： 10.1007/978-3-319-46487-9_6 http://dx.doi.org/10.1007/978-3-319-46487-9_6 ］

Hazourli A R ， Djeghri A ， Salam H and Othmani A . 2020 . Deep multi-facial patches aggregation network for facial expression recognition ［EB/OL］. ［ 2023-06-15 ］. https://arxiv.org/pdf/2002.09298.pdf https://arxiv.org/pdf/2002.09298.pdf

He K M ， Zhang X Y ， Ren S Q and Sun J . 2016 . Deep residual learning for image recognition // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Las Vegas， USA ： IEEE： 770 - 778 ［ DOI： 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ］

Huang Q H ， Huang C Q ， Wang X Z and Jiang F . 2021 . Facial expression recognition with grid-wise attention and visual Transformer . Information Sciences ， 580 ： 35 - 54 ［ DOI： 10.1016/j.ins.2021.08.043 http://dx.doi.org/10.1016/j.ins.2021.08.043 ］

Li S ， Deng W H and Du J P . 2017 . Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild // Proceedings of 2017 IEEE conference on Computer Vision and Pattern Recognition （CVPR） . Hawaii， USA ： IEEE： 2852 - 2861 ［ DOI： 10.1109/CVPR.2017.277 http://dx.doi.org/10.1109/CVPR.2017.277 ］

Li Y ， Zeng J B ， Shan S G and Chen X L . 2019 . Occlusion aware facial expression recognition using CNN with attention mechanism . IEEE Transactions on Image Processing ， 28 （ 5 ）： 2439 - 2450 ［ DOI： 10.1109/TIP.2018.2886767 http://dx.doi.org/10.1109/TIP.2018.2886767 ］

Liang H G ， Bo Y ， Lei Y X ， Yu Z X and Liu L H . 2022 . A CNN-improved and channel-weighted lightweight human facial expression recognition method . Journal of Image and Graphics ， 27 （ 12 ）： 3491 - 3502

梁华刚，薄颖，雷毅雄，喻子鑫，刘丽华 . 2022 . 结合改进卷积神经网络与通道加权的轻量级表情识别 . 中国图象图形学报， 27 （ 12 ）： 3491 - 3502 ［ DOI： 10.11834/ jig.210945 http://dx.doi.org/10.11834/jig.210945 ］

Lucey P ， Cohn J F ， Kanade T ， Saragih J ， Ambadar Z and Matthews I . 2010 . The extended Cohn-Kanade dataset （CK+）： a complete dataset for action unit and emotion-specified expression // Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition–Workshops . San Francisco， USA ： IEEE： 94 - 101 ［ DOI： 10.1109/CVPRW.2010.5543262 http://dx.doi.org/10.1109/CVPRW.2010.5543262 ］

Ma F Y ， Sun B and Li S T . 2023 . Facial expression recognition with visual Transformers and attentional selective fusion . IEEE Transactions on Affective Computing ， 14 （ 2 ）： 1236 - 1248 ［ DOI： 10.1109/TAFFC.2021.3122146 http://dx.doi.org/10.1109/TAFFC.2021.3122146 ］

Mao J Y ， He T N ， Guo Y and Li A B . 2022 . Expression recognition based on global attention and pyramidal convolution network . Computer Engineering and Applications ， 58 （ 23 ）： 214 - 220

毛君宇，何廷年，郭艺，李爱斌 . 2022 . 基于全局注意力及金字塔卷积网络的表情识别 . 计算机工程与应用， 58 （ 23 ）： 214 - 220 ［ DOI： 10.3778/j.issn.1002-8331.2105-0422 http://dx.doi.org/10.3778/j.issn.1002-8331.2105-0422 ］

Mollahosseini A ， Hasani B and Mahoor M H . 2019 . AffectNet： a database for facial expression， valence， and arousal computing in the wild . IEEE Transactions on Affective Computing ， 10 （ 1 ）： 18 - 31 ［ DOI： 10.1109/TAFFC.2017.2740923 http://dx.doi.org/10.1109/TAFFC.2017.2740923 ］

Pantic M ， Valstar M ， Rademaker R and Maat L . 2005 . Web-based database for facial expression analysis // Proceedings of 2005 IEEE International Conference on Multimedia and Expo . Amsterdam， the Netherlands ： IEEE： #5 ［ DOI： 10.1109/ICME.2005.1521424 http://dx.doi.org/10.1109/ICME.2005.1521424 ］

Pratama B G ， Ardiyanto I and Adji T B . 2017 . A review on driver drowsiness based on image， bio-signal， and driver behavior // Proceedings of the 3rd International Conference on Science and Technology-Computer （ICST） . Yogyakarta， Indonesia ： IEEE： 70 - 75 ［ DOI： 10.1109/ICSTC.2017.8011855 http://dx.doi.org/10.1109/ICSTC.2017.8011855 ］

Rehman S ， Raza S J ， Stegemann A P ， Zeeck K ， Din R ， Llewellyn A ， Dio L ， Trznadel M ， Seo Y W ， Chowriappa A J ， Kesavadas T ， Ahmed K and Guru K A . 2013 . Simulation-based robot-assisted surgical training： a health economic evaluation . International Journal of Surgery ， 11 （ 9 ）： 841 - 846 ［ DOI： 10.1016/j.ijsu.2013.08.006 http://dx.doi.org/10.1016/j.ijsu.2013.08.006 ］

Sawyer R ， Smith A ， Rowe J ， Azevedo R and Lester J . 2017 . Enhancing student models in game-based learning with facial expression recognition // Proceedings of the 25th Conference on User Modeling， Adaptation and Personalization . Bratislava， Slovakia ： ACM： 192 - 201 ［ DOI： 10.1145/3079628.3079686 http://dx.doi.org/10.1145/3079628.3079686 ］

Su C ， Wei J G ， Lin D Y and Kong L H . 2023 . Using attention LSGB network for facial expression recognition . Pattern Analysis and Applications ， 26 （ 2 ）： 543 - 553 ［ DOI： 10.1007/s10044-022-01124-w http://dx.doi.org/10.1007/s10044-022-01124-w ］

Wang K ， Peng X J ， Yang J F ， Meng D B and Qiao Y . 2020a . Region attention networks for pose and occlusion robust facial expression recognition . IEEE Transactions on Image Processing ， 29 ： 4057 - 4069 ［ DOI： 10.1109/TIP.2019.2956143 http://dx.doi.org/10.1109/TIP.2019.2956143 ］

Wang Q L ， Wu B G ， Zhu P F ， Li P H ， Zuo W M and Hu Q H . 2020b . ECA-Net： efficient channel attention for deep convolutional neural networks // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Seattle， USA ： 11534 - 11542 ［ DOI： 10.1109/CVPR42600.2020.01155 http://dx.doi.org/10.1109/CVPR42600.2020.01155 ］

Wen Z Y ， Lin W Z ， Wang T and Xu G . 2023 . Distract your attention： multi-head cross attention network for facial expression recognition . Biomimetics ， 8 （ 2 ）： # 199 ［ DOI： 10.3390/biomimetics8020199 http://dx.doi.org/10.3390/biomimetics8020199 ］

Woo S ， Park J ， Lee J Y and Kweon I S . 2018 . CBAM： convolutional block attention module // Proceedings of the 15th European Conference on Computer Vision-ECCV 2018 . Munich， Germany ： Springer： 3 - 19 ［ DOI： 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ］

Yang H Y ， Ciftci U and Yin L J . 2018 . Facial expression recognition by de-expression residue learning // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 2168 - 2177 ［ DOI： 10.1109/CVPR.2018.00231 http://dx.doi.org/10.1109/CVPR.2018.00231 ］

Yao L S ， He S X ， Su K and Shao Q T . 2022 . Facial expression recognition based on spatial and channel attention mechanisms . Wireless Personal Communications ， 125 （ 2 ）： 1483 - 1500 ［ DOI： 10.1007/s11277-022-09616-y http://dx.doi.org/10.1007/s11277-022-09616-y ］

Zhang K H ， Huang Y Z ， Du Y and Wang L . 2017 . Facial expression recognition based on deep evolutional spatial-temporal networks . IEEE Transactions on Image Processing ， 26 （ 9 ）： 4193 - 4203 ［ DOI： 10.1109/TIP.2017.2689999 http://dx.doi.org/10.1109/TIP.2017.2689999 ］

Zhao G Y ， Huang X H ， Taini M ， Li S Z and Pietikäinen M . 2011 . Facial expression recognition from near-infrared videos . Image and Vision Computing ， 29 （ 9 ）： 607 - 619 ［ DOI： 10.1016/j.imavis.2011.07.002 http://dx.doi.org/10.1016/j.imavis.2011.07.002 ］

Zhao Z Q ， Liu Q S and Wang S M . 2021 . Learning deep global multi-scale and local attention features for facial expression recognition in the wild . IEEE Transactions on Image Processing ， 30 ： 6544 - 6556 ［ DOI： 10.1109/TIP.2021.3093397 http://dx.doi.org/10.1109/TIP.2021.3093397 ］

文章被引用时，请邮件提醒。

提交

融合局部特征的面部遮挡表情识别

融合特征增强与互补的手物姿态估计方法

双分支注意和特征交互的小样本细粒度学习

多尺度大核注意力特征融合网络的图像超分辨率重建

用于遥感场景分类的全局—局部特征耦合网络