采用时空注意力机制的人脸微表情识别

李国豪; 袁一帆; 贲晛烨; 张军平

doi:10.11834/jig.200325

人脸表情识别 | 浏览量 : 0 下载量: 0 CSCD: 5

PDF
导出
分享
收藏
专辑

采用时空注意力机制的人脸微表情识别
Spatiotemporal attention network for microexpression recognition
2020年25卷第11期页码：2380-2390
纸质出版日期： 2020-11-16 ，

录用日期： 2020-09-14
DOI： 10.11834/jig.200325
稿件说明：

移动端阅览

李国豪, 袁一帆, 贲晛烨, 张军平. 采用时空注意力机制的人脸微表情识别[J]. 中国图象图形学报, 2020,25(11):2380-2390.

Guohao Li, Yifan Yuan, Xianye Ben, Junping Zhang. Spatiotemporal attention network for microexpression recognition[J]. Journal of Image and Graphics, 2020,25(11):2380-2390.
李国豪, 袁一帆, 贲晛烨, 张军平. 采用时空注意力机制的人脸微表情识别[J]. 中国图象图形学报, 2020,25(11):2380-2390. DOI： 10.11834/jig.200325.

Guohao Li, Yifan Yuan, Xianye Ben, Junping Zhang. Spatiotemporal attention network for microexpression recognition[J]. Journal of Image and Graphics, 2020,25(11):2380-2390. DOI： 10.11834/jig.200325.

摘要

目的

微表情是人自发产生的一种面部肌肉运动，可以展现人试图掩盖的真实情绪，在安防、嫌疑人审问和心理学测试等有潜在的应用。为缓解微表情面部肌肉变化幅度小、持续时间短所带来的识别准确率低的问题，本文提出了一种用于识别微表情的时空注意力网络（spatiotemporal attention network，STANet）。

方法

STANet包含一个空间注意力模块和一个时间注意力模块。首先，利用空间注意力模块使模型的注意力集中在产生微表情强度更大的区域，再利用时间注意力模块对微表情变化更大因而判别性更强的帧给予更大的权重。

结果

在3个公开微表情数据集（The Chinese Academy of Sciences microexpression，CASME；CASME II；spontaneous microexpression database-high speed camera，SMIC-HS）上，使用留一交叉验证与其他8个算法进行了对比实验。实验结果表明，STANet在CASME数据集上的分类准确率相比于性能第2的模型Sparse MDMO（sparse main directional mean optical flow）提高了1.78%；在CASME II数据集上，分类准确率相比于性能第2的模型HIGO（histogram of image gradient orientation）提高了1.90%；在SMIC-HS数据集上，分类准确率达到了68.90%。

结论

针对微表情肌肉幅度小、产生区域小、持续时间短的特点，本文将注意力机制用于微表情识别任务中，提出了STANet模型，使得模型将注意力集中于产生微表情幅度更大的区域和相邻帧之间变化更大的片段。

Abstract

Objective

Microexpression

a kind of spontaneous facial muscle movement

can conceal the real underlying emotions of people. Microexpression has potential applications in security

police interrogation

and psychological testing. Compared with macroexpression

the lower intensity and shorter duration of microexpressions increase the difficulty in recognition. Traditional methods can be divided into facial image- and optical flow-based approaches. Facial image-based methods utilize spatiotemporal partition blocks to construct feature vectors wherein spatiotemporal segmentation parameters are regarded as hyperparameters. Each sample of the dataset uses the same hyperparameters. The performance of microexpression recognition may suffer when using the same spatiotemporal division blocks for different samples

which may require varying spatiotemporal segmentation blocks. Optical flow-based methods are widely used for microexpression recognition. Although such methods demonstrate satisfactory robustness in the variation of illumination

facial features in different regions are considered equally important but ignore the appearance of microexpression in partial regions. Attention mechanism

which has been introduced in many fields

such as natural language processing and computer vision

can focus on salient regions of the object and give additional weights to these regions. We apply the attention mechanism to the microexpression recognition task and propose a spatiotemporal attention network (STANet) due to its outstanding performance in recognition tasks.

Method

STANet mainly consists of spatial spatial attention module (SAM) and temporal temporal attention module (TAM) attention modules. SAM is used to focus on microexpression regions with high intensity while TAM is incorporated to learn discriminative frames

which are given additional weights. Inspired by the fully convolutional network (FCN)

which was proposed in semantic segmentation

we propose a spatial attention branch (SAB) in the SAM. SAB

a top-down and bottom-up structure

is a crucial component of SAM. Convolutional layers and nonlinear transformation are used to extract salient features of the microexpression in the downsampling process

followed by maximum pooling. The maximum pooling operation is utilized to reduce the resolution and increase the receptive field of the feature map. We use bilinear interpolation in the upsampling process to recover the feature map to its original size gradually and adopt skip connections to retain detailed information

which may be lost in the upsampling process. Sigmoid function is ultimately adopted after the last layer of the feature map to normalize the SAB output to [0

. Furthermore

we propose a temporal attention branch (TAB) to focus on the additional discriminative frames in the microexpression sequence

which are crucial in microexpression recognition. Experiments are conducted using the Chinese Academy of Sciences microexpression (CASME)

the Chinese Academy of Sciences microexpression II (CASME II)

and spontaneous microexpression database-high speed camera (SMIC-HS) datasets with 171

246 and 164 samples

respectively. Corner crop and rescaling augmentations are used in CASME and CASME II to avoid overfitting. Scaling factors are set to 0.9

1.0 and 1.1. Corner crop and horizontal flip augmentations are applied in the SMIC-HS dataset. Linear interpolation is used to interpolate samples into 20 frames because various samples have different numbers of frames. Samples are then resized to 192×192 pixels. Finally

we use FlowNet 2.0 to obtain the optical flow sequence of each frame. Experimental settings use the Adam optimizer with a learning rate of 1E-5. Weight decay coefficient is set to 1E-4 and the regularization term of coefficient

of $\ell $

is set to 1E-8. The number of iterations is 60

30 and 100 for CASME

CASME II and SMIC-HS

respectively.

Result

We compared our model with eight state-of-the-art frameworks

including facial image- and optical flow-based methods using three public microexpression datasets

namely

CASME

CASME II and SMIC-HS. Leave-one-subject-out (LOSO) cross validation is used due to insufficient samples. We utilize classification accuracy to measure the performance of methods. The results showed that our model achieves the best performance with CASME and CASME II datasets. Our model's classification accuracy rate in the CASME dataset is 1.78% higher than the Sparse MDMO

which ranks second. The classification accuracy rate of STANet in the CASME II dataset is 1.90% higher than histogram of image gradient orientation (HIGO). The classification accuracy rate of our model in the SMIC-HS dataset is 68.90%. Ablation studies are also performed using the CASME dataset. The results verified the validity of the SAM and TAM

and the fusion algorithm can significantly improve the recognition accuracy.

Conclusion

STANet is proposed in this study for microexpression recognition. SAM emphasizes salient regions of the microexpression by placing additional weights on these regions. Additionally

TAM can learn large weights for clips with high variation in sequences. Experiments performed using the three public microexpression datasets illustrated that STANet achieves the highest recognition accuracy rate on CASME and CASME II datasets compared with eight other state-of-the-art methods and demonstrates satisfactory prediction performance using the SMIC-HS dataset.

关键词

微表情识别分类面部特征深度学习注意力模型时空注意力

Keywords

microexpression recognitionclassificationfacial featuredeep learningattention mechanismspatiotemporal attention

references

Bahdanau D, Cho K and Bengio Y. 2014. Neural machine translation by jointly learning to align and translate[EB/OL].[2020-06-19].https://arxiv.org/pdf/1409.0473.pdfhttps://arxiv.org/pdf/1409.0473.pdf

Davison A K, Yap M H and Lansley C. 2015. Micro-facial movement detection using individualised baselines and histogram-based descriptors//Proceedings of 2015 IEEE International Conference on Systems, Man, and Cybernetics. Kowloon, China: IEEE: 1864-1869[DOI: 10.1109/SMC.2015.326http://dx.doi.org/10.1109/SMC.2015.326]

Happy S L and Routray A. 2019. Fuzzy histogram of optical flow orientations for micro-expression recognition. IEEE Transactions on Affective Computing, 10(3):394-406[DOI:10.1109/TAFFC.2017.2723386]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE Conference on Compten Vision and Pattern Recognition. Salt Lake City: IEEE: 7132-7141[DOI: 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]

Huang X H, Wang S J, Liu X, Zhao G Y, Feng X Y and Pietikäinen M. 2019. Discriminative spatiotemporal local binary pattern with revisited integral projection for spontaneous facial micro-expression recognition. IEEE Transactions on Affective Computing, 10(1):32-47[DOI:10.1109/TAFFC.2017.2713359]

Huang X H, Wang S J, Zhao G Y and Piteikäinen M. 2015. Facial micro-expression recognition using spatiotemporal local binary pattern with integral projection//Proceedings of 2015 IEEE International Conference on Computer Vision Workshop. Santiago: IEEE: 1-9[DOI: 10.1109/ICCVW.2015.10http://dx.doi.org/10.1109/ICCVW.2015.10]

IlgE, Mayer N, Saikia T, Keuper M, Dosovitskiy A and Brox T. 2017. FlowNet 2.0: evolution of optical flow estimation with deep networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 1647-1655[DOI: 10.1109/CVPR.2017.179http://dx.doi.org/10.1109/CVPR.2017.179]

Jaderberg M, Simonyan K, Zisserman A and Kavukcuoglu K. 2015. Spatial transformer networks//Proceedings of the 28th Advances in Neural Information Processing Systems. Red Hook: NIPS: 2017-2025

Khor H Q, See J, Phan R C W and Lin W Y. 2018. Enriched long-term recurrent convolutional network for facial micro-expression recognition//Proceedings of 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi'an: IEEE: 667-674[DOI: 10.1109/FG.2018.00105http://dx.doi.org/10.1109/FG.2018.00105]

Kim D H, Baddar W J and Ro Y M. 2016. Micro-expression recognition with expression-state constrained spatio-temporal feature representations//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, Holland: ACM: 382-386[DOI: 10.1145/2964284.2967247http://dx.doi.org/10.1145/2964284.2967247]

Li X B, Pfister T, Huang X H, Zhao G Y and Pietikäinen M. 2013. A spontaneous micro-expression database: inducement, collection and baseline//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai, China: IEEE: 1-6[DOI: 10.1109/FG.2013.6553717http://dx.doi.org/10.1109/FG.2013.6553717]

Liu J, Wang G, Hu P, Duan L Y and Kot A C. 2017. Global context-aware attention LSTM networks for 3D action recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 3671-3680[DOI: 10.1109/CVPR.2017.391http://dx.doi.org/10.1109/CVPR.2017.391]

Liu Y J, Li B J and Lai Y K. 2018. Sparse MDMO: learning a discriminative feature for spontaneous micro-expression recognition. IEEE Transactions on Affective Computing: 1-8[DOI: 10.1109/TAFFC.2018.2854166http://dx.doi.org/10.1109/TAFFC.2018.2854166]

Liu Y J, Zhang J K, Yan W J, Wang S J, Zhao G Y and Fu X L. 2016. A main directional mean optical flow feature for spontaneous micro-expression recognition. IEEE Transactions on Affective Computing, 7(4):299-310[DOI:10.1109/TAFFC.2015.2485205]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE: 3431-3440[DOI: 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965]

Lu J S, Yang J W, Batra D and Parikh D. 2016. Hierarchical question-image co-attention for visual question answering//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook: ACM: 289-297

Mnih V, Heess N, Graves A and Kavukcuoglu K. 2014. Recurrent models of visual attention//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: ACM: 2204-2212

Newell A, Yang K Y and Deng J. 2016. Stacked hourglass networks for human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 483-499[DOI: 10.1007/978-3-319-46484-8_29http://dx.doi.org/10.1007/978-3-319-46484-8_29]

Patel D, Hong X P and Zhao G Y. 2016. Selective deep features for micro-expression recognition//Proceedings of the 23rd International Conference on Pattern Recognition. Cancun: IEEE: 2258-2263[DOI: 10.1109/ICPR.2016.7899972http://dx.doi.org/10.1109/ICPR.2016.7899972]

Peng M, Wang C Y, Chen T, Liu G Y and Fu X L. 2017. Dual temporal scale convolutional neural network for micro-expression recognition. Frontiers in Psychology, 8:1745[DOI:10.3389/FPSYG.2017.01745]

Peng Y X, Zhao Y Z and Zhang J C. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for VideoTechnology, 29(3):773-786[DOI:10.1109/TCSVT.2018.2808685]

Pfister T, Li X B, Zhao G Y and Pietikäinen M. 2011. Recognising spontaneous facial micro-expressions//Proceedings of 2011 International Conference on Computer Vision. Barcelona: IEEE: 1449-1456[DOI: 10.1109/ICCV.2011.6126401http://dx.doi.org/10.1109/ICCV.2011.6126401]

Wang F, Jiang M Q, Qian C, Yang S, Li C, Zhang H G, Wang X G and Tang X O. 2017. Residual attention network for image classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 6450-6458[DOI: 10.1109/CVPR.2017.683http://dx.doi.org/10.1109/CVPR.2017.683]

Wang S J, Yan W J, Li X B, Zhao G Y, Zhou C G, Fu X L, Yang M H and Tao J H. 2015. Micro-expression recognition using color spaces. IEEE Transactions on Image Processing, 24(12):6034-6047[DOI:10.1109/TIP.2015.2496314]

Wang Y D, See J, Phan R C W and Oh Y H. 2014. LBP with six intersection points: reducing redundant information in LBP-TOP for micro-expression recognition//Proceedings of the 12th Asian Conference on Computer Vision. Singapore: Springer: 525-537[DOI: 10.1007/978-3-319-16865-4_34http://dx.doi.org/10.1007/978-3-319-16865-4_34]

Xu F, Zhang J P and Wang J Z. 2017. Microexpression identification and categorization using a facial dynamics map. IEEE Transactions on Affective Computing, 8(2):254-267[DOI:10.1109/TAFFC.2016.2518162]

Yan W J, Li X B, Wang S J, Zhao G Y, Liu Y J, Chen Y H and Fu X L. 2014. CASME II:an improved spontaneous micro-expression database and the baseline evaluation. PLoS One, 9(1):e86041[DOI:10.1371/journal.pone.0086041]

Yan W J, Wu Q, Liu Y J, Wang S J and Fu X L. 2013. CASME database: a dataset of spontaneous micro-expressions collected from neutralized faces//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai: IEEE: 1-7[DOI: 10.1109/FG.2013.6553799http://dx.doi.org/10.1109/FG.2013.6553799]

Zong Y, Huang X H, Zheng W M, Cui Z and Zhao G Y. 2018. Learning from hierarchical spatiotemporal descriptors for micro-expression recognition. IEEE Transactions on Multimedia, 20(11):3160-3172[DOI:10.1109/TMM.2018.2820321]

文章被引用时，请邮件提醒。

提交

关联子域对齐网络的跨域高光谱图像分类

磁共振影像深度学习在精神分裂症诊断中的应用综述

遥感图像飞机目标分类的卷积神经网络方法

数字人脸渲染与外观恢复方法综述

混合监督学习的乳腺癌全切片病理图像分类