Combining ViT with contrastive learning for facial expression recognition

Cui Xinyu; He Chong; Zhao Hongke; Wang Meili

doi:10.11834/jig.230043

Image Analysis and Recognition | Views : 0 下载量: 3 CSCD: 0

PDF
Export
Share
Collection
Album

Combining ViT with contrastive learning for facial expression recognition
Vol. 29, Issue 1, Pages: 123-133(2024)
Published： 16 January 2024 ，
DOI： 10.11834/jig.230043
稿件说明：

移动端阅览

崔鑫宇，何翀，赵宏珂，王美丽. 2024. 融合ViT与对比学习的面部表情识别. 中国图象图形学报， 29(01):0123-0133

Cui Xinyu， He Chong， Zhao Hongke， Wang Meili. 2024. Combining ViT with contrastive learning for facial expression recognition. Journal of Image and Graphics， 29(01):0123-0133
崔鑫宇，何翀，赵宏珂，王美丽. 2024. 融合ViT与对比学习的面部表情识别. 中国图象图形学报， 29(01):0123-0133 DOI： 10.11834/jig.230043.

Cui Xinyu， He Chong， Zhao Hongke， Wang Meili. 2024. Combining ViT with contrastive learning for facial expression recognition. Journal of Image and Graphics， 29(01):0123-0133 DOI： 10.11834/jig.230043.

摘要

目的

面部表情识别是计算机视觉领域中的重要任务之一，而真实环境下面部表情识别的准确度较低。针对面部表情识别中存在的遮挡、姿态变化和光照变化等问题导致识别准确度较低的问题，提出一种基于自监督对比学习的面部表情识别方法，可以提高遮挡等变化条件下面部表情识别的准确度。

方法

该方法包含对比学习预训练和模型微调两个阶段。在对比学习预训练阶段，改进对比学习的数据增强方式及正负样本对对比次数，选取基于Transformer的视觉Transformer（vision Transformer，ViT）网络作为骨干网络，并在ImageNet数据集上训练模型，提高模型的特征提取能力。模型微调阶段，采用训练好的预训练模型，用面部表情识别目标数据集微调模型获得识别结果。

结果

实验在4类数据集上与13种方法进行了比较，在RAF-DB（real-world affective faces database）数据集中，相比于Face2Exp（combating data biases for facial expression recognition）模型，识别准确度提高了0.48%；在FERPlus（facial expression recognition plus）数据集中，相比于KTN（knowledgeable teacher network）模型，识别准确度提高了0.35%；在AffectNet-8数据集中，相比于SCN（self-cure network）模型，识别准确度提高了0.40%；在AffectNet-7数据集中，相比于DACL（deep attentive center loss）模型，识别准确度略低0.26%，表明了本文方法的有效性。

结论

本文所提出的人脸表情识别模型，综合了对比学习模型和ViT模型的优点，提高了面部表情识别模型在遮挡等条件下的鲁棒性，使面部表情识别结果更加准确。

Abstract

Objective

Facial expression is one of the important factors in human communication to help understand the intentions of others. The task of facial expression recognition is to output the category of facial expression corresponding to a given face picture. Facial expression has broad applications in areas such as security monitoring， education， and human-computer interaction. Currently， facial expression recognition under uncontrolled conditions suffers from low accuracy due to factors such as pose variations， occlusions， and lighting differences. Addressing these issues will remarkably advance the development of facial expression recognition in real-world scenarios and hold great relevance in the field of artificial intelligence. Self-supervised learning is proposed to utilize specific data augmentations on input data and generate pseudo labels for training or pretraining models. Self-supervised learning leverages a large amount of unlabeled data and extracts the prior knowledge distribution of the images themselves to improve the performance of downstream tasks. Contrast learning belongs to self-supervised learning， which can further learn the intrinsic consistent feature information between similar images under the change of posture and light by increasing the difficulty of the task. This paper proposes an unsupervised contrastive learning-based facial expression classification method to address the problem of low accuracy caused by occlusion， pose variation， and lighting changes in facial expression recognition.

Method

To address the issue of occlusions in facial expression recognition datasets under real-world conditions， a method based on negative sample-based self-supervised contrastive learning is employed. The method consists of two stages： contrastive learning pretraining and model fine-tuning. First， in the pretraining stage of contrastive learning， an unsupervised contrastive loss is introduced to reduce the distance between images of the same type and increase the distance between images of different classes to improve the discrimination ability of intraclass diversity and interclass similarity images of facial expression images. This method involves adding positive sample pairs for contrastive learning between the original images and occlusion-augmented images， enhancing the robustness of the model to image occlusion and illumination changes. Additionally， a dictionary mechanism is applied to MoCo v3 to overcome the issue of insufficient memory during training. The recognition model is pretrained on the ImageNet dataset. Next， the model is fine-tuned on the facial expression recognition dataset to improve the classification accuracy for facial expression recognition tasks. This approach effectively enhances the performance of facial expression recognition in the presence of occlusions. Moreover， the Transformer-based vision Transformer （ViT） network is employed as the backbone network to enhance the model’s feature extraction capability.

Result

Experiments were conducted on four datasets to evaluate the performance of the proposed method compared with the latest 13 methods. In the RAF-DB dataset， compared with the Face2Exp model， the recognition accuracy increased by 0.48%； in the FERPlus dataset， compared with the knowledgeable teacher network （KTN） model， The recognition accuracy increased by 0.35%； in the AffectNet-8 dataset， compared with the self-cure network （SCN） model， the recognition accuracy increased by 0.40%； in the AffectNet-7 dataset， compared with the deep attentive center loss （DACL） model， the recognition accuracy was slightly lower by 0.26%， which proves the effectiveness of the method in this paper.

Conclusion

A self-supervised contrastive learning-based method for facial expression recognition is proposed to address the challenges of occlusion， pose variation， and illumination changes in uncontrolled conditions. The method consists of two stages： pretraining and fine-tuning. The contributions of this paper lie in the integration of ViT into the contrastive learning framework， which enables the utilization of a large amount of unlabeled， noise-occluded data to learn the distribution characteristics of facial expression data. The proposed method achieves promising accuracy on facial expression recognition datasets， including RAF-DB， FERPlus， AffectNet-7， and AffectNet-8. By leveraging the contrastive learning framework and advanced feature extraction networks， this work enhances the application of deep learning methods in everyday visual tasks.

关键词

表情识别对比学习自监督学习Transformer正负样本对

Keywords

facial expression recognitioncomparative learningself-supervised learningTransformerpositive and negative samples

references

Barsoum E， Zhang C， Ferrer C C and Zhang Z Y. 2016. Training deep networks for facial expression recognition with crowd-sourced label distribution//Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo， Japan： ACM： 279-283 ［DOI： 10.1145/2993148.2993165http://dx.doi.org/10.1145/2993148.2993165］

Caron M， Misra I， Mairal J， Goyal P， Bojanowski P and Joulin A. 2020. Unsupervised learning of visual features by contrasting cluster assignments//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： 9912-9924

Chen T， Kornblith S， Norouzi M and Hinton G E. 2020a. A simple framework for contrastive learning of visual representations//Proceedings of the 37th International Conference on Machine Learning. Virtual Event： JMLR.org： 1597-1607

Chen X L， Fan H Q， Girshick R and He K M. 2020b. Improved baselines with momentum contrastive learning ［EB/OL］. ［2023-01-29］. https://arxiv.org/pdf/2003.04297.pdfhttps://arxiv.org/pdf/2003.04297.pdf

Chen X L， Xie S N and He K M. 2021. An empirical study of training self-supervised vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 9620-9629 ［DOI： 10.1109/ICCV48922.2021.00950http://dx.doi.org/10.1109/ICCV48922.2021.00950］

Deng J， Dong W， Socher R， Li L J， Li K and Li F F. 2009. ImageNet： a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami， USA： IEEE： 248-255 ［DOI： 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［2023-01-29］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Ekman P and Friesen W V. 1971. Constants across cultures in the face and emotion. Journal of Personality and Social Psychology， 17（2）： 124-129 ［DOI： 10.103//h0030377http://dx.doi.org/10.103//h0030377］

Ekman P and Friesen W V. 1978. Facial Action Coding System： A Technique for the Measurement of Facial Movement. Palo Alto， USA： Consulting Psychologists Press

Fard A P and Mahoor M H. 2022. Ad-Corre： adaptive correlation-based loss for facial expression recognition in the wild. IEEE Access， 10： 26756-26768 ［DOI： 10.1109/ACCESS.2022.3156598http://dx.doi.org/10.1109/ACCESS.2022.3156598］

Farzaneh A H and Qi X J. 2020. Discriminant distribution-agnostic loss for facial expression recognition in the wild//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle， USA： IEEE： 1631-1639 ［DOI： 10.1109/CVPRW50498.2020.00211http://dx.doi.org/10.1109/CVPRW50498.2020.00211］

Farzaneh A H and Qi X J. 2021. Facial expression recognition in the wild via deep attentive center loss//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 2401-2410 ［DOI： 10.1109/WACV48630.2021.00245http://dx.doi.org/10.1109/WACV48630.2021.00245］

Grill J B， Strub F， Altché F， Tallec C， Richemond P H， Buchatskaya E， Doersch C， Pires B A， Guo Z D， Azaret M G， Piot B， Kavukcuoglu K， Munos R and Valko M. 2020. Bootstrap your own latent a new approach to self-supervised learning//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： 21271-21284

He K M， Fan H Q， Wu Y X， Xie S N and Girshick R. 2020. Momentum contrast for unsupervised visual representation learning//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 9726-9735 ［DOI： 10.1109/CVPR42600.2020.00975http://dx.doi.org/10.1109/CVPR42600.2020.00975］

Huang Q H， Huang C Q， Wang X Z and Jiang F. 2021. Facial expression recognition with grid-wise attention and visual Transformer. Information Sciences， 580： 35-54 ［DOI： 10.1016/j.ins.2021.08.043http://dx.doi.org/10.1016/j.ins.2021.08.043］

Li H Y， Wang N N， Ding X P， Yang X and Gao X B. 2021. Adaptively learning facial expression representation via C-F labels and distillation. IEEE Transactions on Image Processing， 30： 2016-2028 ［DOI： 10.1109/tip.2021.3049955http://dx.doi.org/10.1109/tip.2021.3049955］

Li S and Deng W H. 2020. Deep facial expression recognition： a survey. Journal of Image and Graphics， 25（11）： 2306-2320

李珊，邓伟洪. 2020. 深度人脸表情识别研究进展. 中国图象图形学报， 25（11）： 2306-2320 ［DOI： 10.11834/jig.200233http://dx.doi.org/10.11834/jig.200233］

Li S， Deng W H and Du J P. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 2584-2593 ［DOI： 10.1109/CVPR.2017.277http://dx.doi.org/10.1109/CVPR.2017.277］

Li Y， Zeng J B， Shan S G and Chen X L. 2019a. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Transactions on Image Processing， 28（5）： 2439-2450 ［DOI： 10.1109/TIP.2018.2886767http://dx.doi.org/10.1109/TIP.2018.2886767］

Li Y J， Lu G M， Li J X， Zhang Z and Zhang D. 2020. Facial expression recognition in the wild using multi-level features and attention mechanisms. IEEE Transactions on Affective Computing， 14（1）： 451-462 ［DOI： 10.1109/TAFFC.2020.3031602http://dx.doi.org/10.1109/TAFFC.2020.3031602］

Li Y J， Lu Y， Li J X and Lu G M. 2019b. Separate loss for basic and compound facial expression recognition in the wild//Proceedings of the 11th Asian Conference on Machine Learning. Nagoya， Japan： PMLR： 897-911

Mollahosseini A， Hasani B and Mahoor M H. 2019. AffectNet： a database for facial expression， valence， and arousal computing in the wild. IEEE Transactions on Affective Computing， 10（1）： 18-31 ［DOI： 10.1109/TAFFC.2017.2740923http://dx.doi.org/10.1109/TAFFC.2017.2740923］

Peng X J and Qiao Y. 2020. Advances and challenges in facial expression analysis. Journal of Image and Graphics， 25（11）： 2337-2348

彭小江，乔宇. 2020. 面部表情分析进展和挑战. 中国图象图形学报， 25（11）： 2337-2348 ［DOI： 10.11834/jig.200308http://dx.doi.org/10.11834/jig.200308］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang K， Peng X J， Yang J F， Lu S J and Qiao Y. 2020a. Suppressing uncertainties for large-scale facial expression recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 6896-6905 ［DOI： 10.1109/CVPR42600.2020.00693http://dx.doi.org/10.1109/CVPR42600.2020.00693］

Wang K， Peng X J， Yang J F， Meng D B and Qiao Y. 2020b. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing， 29： 4057-4069 ［DOI： 10.1109/TIP.2019.2956143http://dx.doi.org/10.1109/TIP.2019.2956143］

Yao H X， Deng W H， Liu H H， Hong X N， Wang S J， Yang J F and Zhao S C. 2022. An overview of research development of affective computing and understanding. Journal of Image and Graphics， 27（6）： 2008-2035

姚鸿勋，邓伟洪，刘洪海，洪晓鹏，王甦菁，杨巨峰，赵思成. 2022. 情感计算与理解研究发展概述. 中国图象图形学报， 27（6）： 2008-2035 ［DOI： 10.11834/jig.220085http://dx.doi.org/10.11834/jig.220085］

Zbontar J， Jing L， Misra I， LeCun Y and Deny S. 2021. Barlow twins： self-supervised learning via redundancy reduction//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR： 12310-12320

Zeng D， Lin Z Y， Yan X， Liu Y T， Wang F and Tang B. 2022. Face2Exp： combating data biases for facial expression recognition//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 20259-20268 ［DOI： 10.1109/CVPR52688.2022.01965http://dx.doi.org/10.1109/CVPR52688.2022.01965］

Zhang Y H， Wang C R， Ling X and Deng W H. 2022. Learn from all： erasing attention consistency for noisy label facial expression recognition//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 418-434 ［DOI： 10.1007/978-3-031-19809-0_24http://dx.doi.org/10.1007/978-3-031-19809-0_24］

Zhao Z Q， Liu Q S and Zhou F. 2021. Robust lightweight facial expression recognition network with label distribution training. Proceedings of the AAAI Conference on Artificial Intelligence， 35（4）： 3510-3519 ［DOI： 10.1609/aaai.v35i4.16465http://dx.doi.org/10.1609/aaai.v35i4.16465］

Alert me when the article has been cited

提交

Whole slide pathological image classification of breast cancer based on mixed supervision learning

Transformer network for stereo matching of weak texture objects

Research progress on speech deepfake and its detection techniques

Spatial-spectral model distillation network for hyperspectral scene classification

Blueprint separable convolution Transformer network for lightweight image super-resolution