注意力引导局部特征联合学习的人脸表情识别
Attention-guided local feature joint learning for facial expression recognition
- 2024年29卷第8期 页码:2377-2387
纸质出版日期: 2024-08-16
DOI: 10.11834/jig.230410
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-08-16 ,
移动端阅览
卢莉丹, 夏海英, 谭玉枚, 宋树祥. 2024. 注意力引导局部特征联合学习的人脸表情识别. 中国图象图形学报, 29(08):2377-2387
Lu Lidan, Xia Haiying, Tan Yumei, Song Shuxiang. 2024. Attention-guided local feature joint learning for facial expression recognition. Journal of Image and Graphics, 29(08):2377-2387
目的
2
在复杂的自然场景下,人脸表情识别存在着眼镜、手部动作和发型等局部遮挡的问题,这些遮挡区域会降低模型的情感判别能力。因此,本文提出了一种注意力引导局部特征联合学习的人脸表情识别方法。
方法
2
该方法由全局特征提取模块、全局特征增强模块和局部特征联合学习模块组成。全局特征提取模块用于提取中间层全局特征;全局特征增强模块用于抑制人脸识别预训练模型带来的冗余特征,并增强全局人脸图像中与情感最相关的特征图语义信息;局部特征联合学习模块利用混合注意力机制来学习不同人脸局部区域的细粒度显著特征并使用联合损失进行约束。
结果
2
在2个自然场景数据集RAF-DB(real-world affective faces database)和FERPlus上进行了相关实验验证。在RAF-DB数据集中,识别准确率为89.24%,与MA-Net(global multi-scale and local attention network)相比有0.84%的性能提升;在FERPlus数据集中,识别准确率为90.04%,与FER-VT(FER framework with two attention mechanisms)的性能相当。实验结果表明该方法具有良好的鲁棒性。
结论
2
本文方法通过先全局增强后局部细化的学习顺序,有效地减少了局部遮挡问题的干扰。
Objective
2
When communicating face to face, people use various methods to convey their inner emotions, such as conversational tone, body movements, and facial expressions. Among these methods, facial expression is the most direct means of observing human emotions. People can convey their thoughts and feelings through facial expression, and they can also use it to recognize others’ attitudes and inner world. Therefore, facial expression recognition belongs to one of the research directions in the field of affective computing. It can obviously be applied to many fields, such as fatigue driving detection, human–computer interaction, students’ listening state analysis, and intelligent medical services. However, in complex natural situations, facial expression recognition suffers from direct occlusion issues such as masks, sunglasses, gestures, hairstyles, or beards, as well as indirect occlusion issues such as different lighting, complex backgrounds, and pose variation. All these concerns can pose great challenges to facial expression recognition in natural scenes, where extracting discriminative features is difficult. Thus, the final recognition results are poor. Therefore, we propose an attention-guided joint learning method for local features in facial expression recognition to reduce the interference of occlusion and pose variation problems.
Method
2
Our method is composed of a global feature extraction module, a global feature enhancement module, and a joint learning module for local features. First, we use ResNet-50 as the backbone network and initialize the network parameters using the MS-Celeb-1M face recognition dataset. We think that the rich information available in the face recognition model can be used to complement the contextual information needed for facial expression recognition, especially the middle layer features such as eyes, nose, and mouth. Thus, the global feature extraction module is used to extract the global features of the middle layer, which consists of a 2D convolutional layer and three bottleneck residual convolutional blocks. Second, most of the facial expression features are concentrated in localized key regions such as eyes, nose, and mouth. Accordingly, the overall face information can be ignored and the expression categories can be directly recognized correctly with the help of local key information. Given that face recognition requires overall facial information, the face recognition pretraining model introduces some unimportant features for expression recognition. Therefore, we utilize a global feature enhancement module to suppress the redundant features (e.g., features in the nose region) brought by the pretrained model for face recognition and enhance the semantic information of global face image that is most relevant to the emotion. This module is implemented by the effcient channel attention(ECA) attention mechanism, which strengthens the channel features that contribute to the classification and weakens the weights of the channel features that are detrimental to the classification through cross-channel interactions between high-level semantic channel features. Finally, we divide the output features of the global feature enhancement module into four non-overlapping local regions uniformly in terms of spatial dimensions. This method exactly distributes the eye and mouth regions in most of the face images in four sub-image blocks. The global facial expression analysis problem is split into multiple local regions for calculation. Then, the fine-grained salient features of different localized regions of the face are learned through the mixed-attention mechanism. The local feature joint learning module learns information from complementary contexts, which reduces the negative effects of occlusion and pose variations. Considering that our method integrates four classifiers for local feature learning, a decision-level fusion strategy is used for the final prediction. That is, after summing the output probability results of the four classifiers, the category corresponding to the maximum probability is the model prediction category.
Result
2
Relevant experimental validation was performed on two in-the-wild expression datasets, namely, real-world affective faces database (RAF-DB) and face expression recognition plus (FERPlus) datasets. The results of the ablation experiments show that the gains of our method compared with the base model on the two datasets are 1.89% and 2.47%, respectively. In the RAF-DB dataset, the recognition accuracy is 89.24%, which has a performance improvement of 0.84% compared with global multi-scale and local attention network (MA-Net). In the FERPlus dataset, the recognition accuracy is 90.04%, which is comparable to the performance of FER framework with two attention mechanisms (FER-VT). Therefore, our method has good robustness. We test the model trained on the RAF-DB dataset by incorporating it with the FED-RO dataset with real occlusion and achieved an accuracy of 67.60%. We also use Grad-CAM++ to visualize the attention heatmap of the proposed model for demonstrating the effectiveness of the proposed method more intuitively. The visualization of the joint learning module for local features illustrates that the module can direct the overall model to focus on the features in each individual local image block that are useful for classification.
Conclusion
2
In general, the proposed method is guided by the attention mechanism, which enhances the global features first and then learns the salient features in the local region. This approach effectively reduces the interference of the local occlusion problem through the learning sequence of global enhancement followed by local refinement. Experiments on two natural scene datasets and the occlusion test set prove that the model is simple, effective, and robust.
人脸表情识别注意力机制局部遮挡局部显著特征联合学习
facial expression recognitionattention mechanismpartial occlusionlocal salient featurejoint learning
Barsoum E, Zhang C, Ferrer C C and Zhang Z Y. 2016. Training deep networks for facial expression recognition with crowd-sourced label distribution//Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo, Japan: ACM: 279-283 [DOI: 10.1145/2993148.2993165http://dx.doi.org/10.1145/2993148.2993165]
Chattopadhay A, Sarkar A, Howlader P and Balasubramanian V N. 2018. Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks//Proceedings of 2018 IEEE winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, USA: IEEE: 839-847 [DOI: 10.1109/WACV.2018.0009http://dx.doi.org/10.1109/WACV.2018.0009]
Chen T, Xing S, Yang W W and Jin J Q. 2022. Spatio-temporal features based human facial expression recognition. Journal of Image and Graphics, 27(7): 2185-2198
陈拓, 邢帅, 杨文武, 金剑秋. 2022. 融合时空域特征的人脸表情识别. 中国图象图形学报, 27(7): 2185-2198 [DOI: 10.11834/jig.200782http://dx.doi.org/10.11834/jig.200782]
Corneanu C A, Simón M O, Cohn J F and Guerrero S E. 2016. Survey on RGB, 3D, thermal, and multimodal approaches for facial expression recognition: history, trends, and affect-related applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8): 1548-1568 [DOI: 10.1109/TPAMI.2016.251560http://dx.doi.org/10.1109/TPAMI.2016.251560]
Ding H, Zhou P and Chellappa R. 2020. Occlusion-adaptive deep network for robust facial expression recognition//Proceedings of 2020 IEEE International Joint Conference on Biometrics (IJCB). Houston, USA: IEEE: 1-9 [DOI: 10.1109/IJCB48548.2020.9304923http://dx.doi.org/10.1109/IJCB48548.2020.9304923]
Ekman P and Friesen W V. 1971. Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2): 124-129 [DOI: 10.1037/h0030377http://dx.doi.org/10.1037/h0030377]
Fan Y R, Lam J C K and Li V O K. 2018. Multi-region ensemble convolutional neural network for facial expression recognition//Proceedings of the 27th International Conference on Artificial Neural Networks. Rhodes, Greece: Springer: 84-94 [DOI: 10.1007/978-3-030-01418-6_9http://dx.doi.org/10.1007/978-3-030-01418-6_9]
Goodfellow I J, Erhan D, Carrier P L, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y C, Thaler D, Lee D H, Zhou Y B, Ramaiah C, Feng F X, Li R F, Wang X J, Athanasakis D, Shawe-Taylor J, Milakov M, Park J, Ionescu R, Popescu M, Grozea C, Bergstra J, Xie J J, Romaszko L, Xu B, Chuang Z and Bengio Y. 2013. Challenges in representation learning: a report on three machine learning contests//Proceedings of the 20th International Conference on Neural Information Processing. Daegu, Korea(South): Springer: 117-124 [DOI: 10.1007/978-3-642-42051-1_16http://dx.doi.org/10.1007/978-3-642-42051-1_16]
Guo Y D, Zhang L, Hu Y X, He X D and Gao J F. 2016. MS-Celeb-1M: a dataset and benchmark for large-scale face recognition//Proceedings of the 15th European Conference on Computer Vision-ECCV 2016. Amsterdam, the Netherlands: Springer: 87-102 [DOI: 10.1007/978-3-319-46487-9_6http://dx.doi.org/10.1007/978-3-319-46487-9_6]
Hazourli A R, Djeghri A, Salam H and Othmani A. 2020. Deep multi-facial patches aggregation network for facial expression recognition [EB/OL]. [2023-06-15]. https://arxiv.org/pdf/2002.09298.pdfhttps://arxiv.org/pdf/2002.09298.pdf
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Huang Q H, Huang C Q, Wang X Z and Jiang F. 2021. Facial expression recognition with grid-wise attention and visual Transformer. Information Sciences, 580: 35-54 [DOI: 10.1016/j.ins.2021.08.043http://dx.doi.org/10.1016/j.ins.2021.08.043]
Li S, Deng W H and Du J P. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild//Proceedings of 2017 IEEE conference on Computer Vision and Pattern Recognition (CVPR). Hawaii, USA: IEEE: 2852-2861 [DOI: 10.1109/CVPR.2017.277http://dx.doi.org/10.1109/CVPR.2017.277]
Li Y, Zeng J B, Shan S G and Chen X L. 2019. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Transactions on Image Processing, 28(5): 2439-2450 [DOI: 10.1109/TIP.2018.2886767http://dx.doi.org/10.1109/TIP.2018.2886767]
Liang H G, Bo Y, Lei Y X, Yu Z X and Liu L H. 2022. A CNN-improved and channel-weighted lightweight human facial expression recognition method. Journal of Image and Graphics, 27(12): 3491-3502
梁华刚, 薄颖, 雷毅雄, 喻子鑫, 刘丽华. 2022. 结合改进卷积神经网络与通道加权的轻量级表情识别. 中国图象图形学报, 27(12): 3491-3502 [DOI: 10.11834/ jig.210945http://dx.doi.org/10.11834/jig.210945]
Lucey P, Cohn J F, Kanade T, Saragih J, Ambadar Z and Matthews I. 2010. The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition–Workshops. San Francisco, USA: IEEE: 94-101 [DOI: 10.1109/CVPRW.2010.5543262http://dx.doi.org/10.1109/CVPRW.2010.5543262]
Ma F Y, Sun B and Li S T. 2023. Facial expression recognition with visual Transformers and attentional selective fusion. IEEE Transactions on Affective Computing, 14(2): 1236-1248 [DOI: 10.1109/TAFFC.2021.3122146http://dx.doi.org/10.1109/TAFFC.2021.3122146]
Mao J Y, He T N, Guo Y and Li A B. 2022. Expression recognition based on global attention and pyramidal convolution network. Computer Engineering and Applications, 58(23): 214-220
毛君宇, 何廷年, 郭艺, 李爱斌. 2022. 基于全局注意力及金字塔卷积网络的表情识别. 计算机工程与应用, 58(23): 214-220 [DOI: 10.3778/j.issn.1002-8331.2105-0422http://dx.doi.org/10.3778/j.issn.1002-8331.2105-0422]
Mollahosseini A, Hasani B and Mahoor M H. 2019. AffectNet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1): 18-31 [DOI: 10.1109/TAFFC.2017.2740923http://dx.doi.org/10.1109/TAFFC.2017.2740923]
Pantic M, Valstar M, Rademaker R and Maat L. 2005. Web-based database for facial expression analysis//Proceedings of 2005 IEEE International Conference on Multimedia and Expo. Amsterdam, the Netherlands: IEEE: #5 [DOI: 10.1109/ICME.2005.1521424http://dx.doi.org/10.1109/ICME.2005.1521424]
Pratama B G, Ardiyanto I and Adji T B. 2017. A review on driver drowsiness based on image, bio-signal, and driver behavior//Proceedings of the 3rd International Conference on Science and Technology-Computer (ICST). Yogyakarta, Indonesia: IEEE: 70-75 [DOI: 10.1109/ICSTC.2017.8011855http://dx.doi.org/10.1109/ICSTC.2017.8011855]
Rehman S, Raza S J, Stegemann A P, Zeeck K, Din R, Llewellyn A, Dio L, Trznadel M, Seo Y W, Chowriappa A J, Kesavadas T, Ahmed K and Guru K A. 2013. Simulation-based robot-assisted surgical training: a health economic evaluation. International Journal of Surgery, 11(9): 841-846 [DOI: 10.1016/j.ijsu.2013.08.006http://dx.doi.org/10.1016/j.ijsu.2013.08.006]
Sawyer R, Smith A, Rowe J, Azevedo R and Lester J. 2017. Enhancing student models in game-based learning with facial expression recognition//Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization. Bratislava, Slovakia: ACM: 192-201 [DOI: 10.1145/3079628.3079686http://dx.doi.org/10.1145/3079628.3079686]
Su C, Wei J G, Lin D Y and Kong L H. 2023. Using attention LSGB network for facial expression recognition. Pattern Analysis and Applications, 26(2): 543-553 [DOI: 10.1007/s10044-022-01124-whttp://dx.doi.org/10.1007/s10044-022-01124-w]
Wang K, Peng X J, Yang J F, Meng D B and Qiao Y. 2020a. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing, 29: 4057-4069 [DOI: 10.1109/TIP.2019.2956143http://dx.doi.org/10.1109/TIP.2019.2956143]
Wang Q L, Wu B G, Zhu P F, Li P H, Zuo W M and Hu Q H. 2020b. ECA-Net: efficient channel attention for deep convolutional neural networks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: 11534-11542 [DOI: 10.1109/CVPR42600.2020.01155http://dx.doi.org/10.1109/CVPR42600.2020.01155]
Wen Z Y, Lin W Z, Wang T and Xu G. 2023. Distract your attention: multi-head cross attention network for facial expression recognition. Biomimetics, 8(2): #199 [DOI: 10.3390/biomimetics8020199http://dx.doi.org/10.3390/biomimetics8020199]
Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision-ECCV 2018. Munich, Germany: Springer: 3-19 [DOI: 10.1007/978-3-030-01234-2_1http://dx.doi.org/10.1007/978-3-030-01234-2_1]
Yang H Y, Ciftci U and Yin L J. 2018. Facial expression recognition by de-expression residue learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2168-2177 [DOI: 10.1109/CVPR.2018.00231http://dx.doi.org/10.1109/CVPR.2018.00231]
Yao L S, He S X, Su K and Shao Q T. 2022. Facial expression recognition based on spatial and channel attention mechanisms. Wireless Personal Communications, 125(2): 1483-1500 [DOI: 10.1007/s11277-022-09616-yhttp://dx.doi.org/10.1007/s11277-022-09616-y]
Zhang K H, Huang Y Z, Du Y and Wang L. 2017. Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Transactions on Image Processing, 26(9): 4193-4203 [DOI: 10.1109/TIP.2017.2689999http://dx.doi.org/10.1109/TIP.2017.2689999]
Zhao G Y, Huang X H, Taini M, Li S Z and Pietikäinen M. 2011. Facial expression recognition from near-infrared videos. Image and Vision Computing, 29(9): 607-619 [DOI: 10.1016/j.imavis.2011.07.002http://dx.doi.org/10.1016/j.imavis.2011.07.002]
Zhao Z Q, Liu Q S and Wang S M. 2021. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, 30: 6544-6556 [DOI: 10.1109/TIP.2021.3093397http://dx.doi.org/10.1109/TIP.2021.3093397]
相关作者
相关机构