采用时空注意力机制的人脸微表情识别
Spatiotemporal attention network for microexpression recognition
- 2020年25卷第11期 页码:2380-2390
纸质出版日期: 2020-11-16 ,
录用日期: 2020-09-14
DOI: 10.11834/jig.200325
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2020-11-16 ,
录用日期: 2020-09-14
移动端阅览
李国豪, 袁一帆, 贲晛烨, 张军平. 采用时空注意力机制的人脸微表情识别[J]. 中国图象图形学报, 2020,25(11):2380-2390.
Guohao Li, Yifan Yuan, Xianye Ben, Junping Zhang. Spatiotemporal attention network for microexpression recognition[J]. Journal of Image and Graphics, 2020,25(11):2380-2390.
目的
2
微表情是人自发产生的一种面部肌肉运动,可以展现人试图掩盖的真实情绪,在安防、嫌疑人审问和心理学测试等有潜在的应用。为缓解微表情面部肌肉变化幅度小、持续时间短所带来的识别准确率低的问题,本文提出了一种用于识别微表情的时空注意力网络(spatiotemporal attention network,STANet)。
方法
2
STANet包含一个空间注意力模块和一个时间注意力模块。首先,利用空间注意力模块使模型的注意力集中在产生微表情强度更大的区域,再利用时间注意力模块对微表情变化更大因而判别性更强的帧给予更大的权重。
结果
2
在3个公开微表情数据集(The Chinese Academy of Sciences microexpression,CASME;CASME II;spontaneous microexpression database-high speed camera,SMIC-HS)上,使用留一交叉验证与其他8个算法进行了对比实验。实验结果表明,STANet在CASME数据集上的分类准确率相比于性能第2的模型Sparse MDMO(sparse main directional mean optical flow)提高了1.78%;在CASME II数据集上,分类准确率相比于性能第2的模型HIGO(histogram of image gradient orientation)提高了1.90%;在SMIC-HS数据集上,分类准确率达到了68.90%。
结论
2
针对微表情肌肉幅度小、产生区域小、持续时间短的特点,本文将注意力机制用于微表情识别任务中,提出了STANet模型,使得模型将注意力集中于产生微表情幅度更大的区域和相邻帧之间变化更大的片段。
Objective
2
Microexpression
a kind of spontaneous facial muscle movement
can conceal the real underlying emotions of people. Microexpression has potential applications in security
police interrogation
and psychological testing. Compared with macroexpression
the lower intensity and shorter duration of microexpressions increase the difficulty in recognition. Traditional methods can be divided into facial image- and optical flow-based approaches. Facial image-based methods utilize spatiotemporal partition blocks to construct feature vectors wherein spatiotemporal segmentation parameters are regarded as hyperparameters. Each sample of the dataset uses the same hyperparameters. The performance of microexpression recognition may suffer when using the same spatiotemporal division blocks for different samples
which may require varying spatiotemporal segmentation blocks. Optical flow-based methods are widely used for microexpression recognition. Although such methods demonstrate satisfactory robustness in the variation of illumination
facial features in different regions are considered equally important but ignore the appearance of microexpression in partial regions. Attention mechanism
which has been introduced in many fields
such as natural language processing and computer vision
can focus on salient regions of the object and give additional weights to these regions. We apply the attention mechanism to the microexpression recognition task and propose a spatiotemporal attention network (STANet) due to its outstanding performance in recognition tasks.
Method
2
STANet mainly consists of spatial spatial attention module (SAM) and temporal temporal attention module (TAM) attention modules. SAM is used to focus on microexpression regions with high intensity while TAM is incorporated to learn discriminative frames
which are given additional weights. Inspired by the fully convolutional network (FCN)
which was proposed in semantic segmentation
we propose a spatial attention branch (SAB) in the SAM. SAB
a top-down and bottom-up structure
is a crucial component of SAM. Convolutional layers and nonlinear transformation are used to extract salient features of the microexpression in the downsampling process
followed by maximum pooling. The maximum pooling operation is utilized to reduce the resolution and increase the receptive field of the feature map. We use bilinear interpolation in the upsampling process to recover the feature map to its original size gradually and adopt skip connections to retain detailed information
which may be lost in the upsampling process. Sigmoid function is ultimately adopted after the last layer of the feature map to normalize the SAB output to [0
1]
. Furthermore
we propose a temporal attention branch (TAB) to focus on the additional discriminative frames in the microexpression sequence
which are crucial in microexpression recognition. Experiments are conducted using the Chinese Academy of Sciences microexpression (CASME)
the Chinese Academy of Sciences microexpression II (CASME II)
and spontaneous microexpression database-high speed camera (SMIC-HS) datasets with 171
246 and 164 samples
respectively. Corner crop and rescaling augmentations are used in CASME and CASME II to avoid overfitting. Scaling factors are set to 0.9
1.0 and 1.1. Corner crop and horizontal flip augmentations are applied in the SMIC-HS dataset. Linear interpolation is used to interpolate samples into 20 frames because various samples have different numbers of frames. Samples are then resized to 192×192 pixels. Finally
we use FlowNet 2.0 to obtain the optical flow sequence of each frame. Experimental settings use the Adam optimizer with a learning rate of 1E-5. Weight decay coefficient is set to 1E-4 and the regularization term of coefficient
λ
of $\ell $
1
is set to 1E-8. The number of iterations is 60
30 and 100 for CASME
CASME II and SMIC-HS
respectively.
Result
2
We compared our model with eight state-of-the-art frameworks
including facial image- and optical flow-based methods using three public microexpression datasets
namely
CASME
CASME II and SMIC-HS. Leave-one-subject-out (LOSO) cross validation is used due to insufficient samples. We utilize classification accuracy to measure the performance of methods. The results showed that our model achieves the best performance with CASME and CASME II datasets. Our model's classification accuracy rate in the CASME dataset is 1.78% higher than the Sparse MDMO
which ranks second. The classification accuracy rate of STANet in the CASME II dataset is 1.90% higher than histogram of image gradient orientation (HIGO). The classification accuracy rate of our model in the SMIC-HS dataset is 68.90%. Ablation studies are also performed using the CASME dataset. The results verified the validity of the SAM and TAM
and the fusion algorithm can significantly improve the recognition accuracy.
Conclusion
2
STANet is proposed in this study for microexpression recognition. SAM emphasizes salient regions of the microexpression by placing additional weights on these regions. Additionally
TAM can learn large weights for clips with high variation in sequences. Experiments performed using the three public microexpression datasets illustrated that STANet achieves the highest recognition accuracy rate on CASME and CASME II datasets compared with eight other state-of-the-art methods and demonstrates satisfactory prediction performance using the SMIC-HS dataset.
微表情识别分类面部特征深度学习注意力模型时空注意力
microexpression recognitionclassificationfacial featuredeep learningattention mechanismspatiotemporal attention
Bahdanau D, Cho K and Bengio Y. 2014. Neural machine translation by jointly learning to align and translate[EB/OL].[2020-06-19].https://arxiv.org/pdf/1409.0473.pdfhttps://arxiv.org/pdf/1409.0473.pdf
Davison A K, Yap M H and Lansley C. 2015. Micro-facial movement detection using individualised baselines and histogram-based descriptors//Proceedings of 2015 IEEE International Conference on Systems, Man, and Cybernetics. Kowloon, China: IEEE: 1864-1869[DOI: 10.1109/SMC.2015.326http://dx.doi.org/10.1109/SMC.2015.326]
Happy S L and Routray A. 2019. Fuzzy histogram of optical flow orientations for micro-expression recognition. IEEE Transactions on Affective Computing, 10(3):394-406[DOI:10.1109/TAFFC.2017.2723386]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE Conference on Compten Vision and Pattern Recognition. Salt Lake City: IEEE: 7132-7141[DOI: 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]
Huang X H, Wang S J, Liu X, Zhao G Y, Feng X Y and Pietikäinen M. 2019. Discriminative spatiotemporal local binary pattern with revisited integral projection for spontaneous facial micro-expression recognition. IEEE Transactions on Affective Computing, 10(1):32-47[DOI:10.1109/TAFFC.2017.2713359]
Huang X H, Wang S J, Zhao G Y and Piteikäinen M. 2015. Facial micro-expression recognition using spatiotemporal local binary pattern with integral projection//Proceedings of 2015 IEEE International Conference on Computer Vision Workshop. Santiago: IEEE: 1-9[DOI: 10.1109/ICCVW.2015.10http://dx.doi.org/10.1109/ICCVW.2015.10]
IlgE, Mayer N, Saikia T, Keuper M, Dosovitskiy A and Brox T. 2017. FlowNet 2.0: evolution of optical flow estimation with deep networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 1647-1655[DOI: 10.1109/CVPR.2017.179http://dx.doi.org/10.1109/CVPR.2017.179]
Jaderberg M, Simonyan K, Zisserman A and Kavukcuoglu K. 2015. Spatial transformer networks//Proceedings of the 28th Advances in Neural Information Processing Systems. Red Hook: NIPS: 2017-2025
Khor H Q, See J, Phan R C W and Lin W Y. 2018. Enriched long-term recurrent convolutional network for facial micro-expression recognition//Proceedings of 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi'an: IEEE: 667-674[DOI: 10.1109/FG.2018.00105http://dx.doi.org/10.1109/FG.2018.00105]
Kim D H, Baddar W J and Ro Y M. 2016. Micro-expression recognition with expression-state constrained spatio-temporal feature representations//Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, Holland: ACM: 382-386[DOI: 10.1145/2964284.2967247http://dx.doi.org/10.1145/2964284.2967247]
Li X B, Pfister T, Huang X H, Zhao G Y and Pietikäinen M. 2013. A spontaneous micro-expression database: inducement, collection and baseline//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai, China: IEEE: 1-6[DOI: 10.1109/FG.2013.6553717http://dx.doi.org/10.1109/FG.2013.6553717]
Liu J, Wang G, Hu P, Duan L Y and Kot A C. 2017. Global context-aware attention LSTM networks for 3D action recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 3671-3680[DOI: 10.1109/CVPR.2017.391http://dx.doi.org/10.1109/CVPR.2017.391]
Liu Y J, Li B J and Lai Y K. 2018. Sparse MDMO: learning a discriminative feature for spontaneous micro-expression recognition. IEEE Transactions on Affective Computing: 1-8[DOI: 10.1109/TAFFC.2018.2854166http://dx.doi.org/10.1109/TAFFC.2018.2854166]
Liu Y J, Zhang J K, Yan W J, Wang S J, Zhao G Y and Fu X L. 2016. A main directional mean optical flow feature for spontaneous micro-expression recognition. IEEE Transactions on Affective Computing, 7(4):299-310[DOI:10.1109/TAFFC.2015.2485205]
Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE: 3431-3440[DOI: 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965]
Lu J S, Yang J W, Batra D and Parikh D. 2016. Hierarchical question-image co-attention for visual question answering//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook: ACM: 289-297
Mnih V, Heess N, Graves A and Kavukcuoglu K. 2014. Recurrent models of visual attention//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: ACM: 2204-2212
Newell A, Yang K Y and Deng J. 2016. Stacked hourglass networks for human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 483-499[DOI: 10.1007/978-3-319-46484-8_29http://dx.doi.org/10.1007/978-3-319-46484-8_29]
Patel D, Hong X P and Zhao G Y. 2016. Selective deep features for micro-expression recognition//Proceedings of the 23rd International Conference on Pattern Recognition. Cancun: IEEE: 2258-2263[DOI: 10.1109/ICPR.2016.7899972http://dx.doi.org/10.1109/ICPR.2016.7899972]
Peng M, Wang C Y, Chen T, Liu G Y and Fu X L. 2017. Dual temporal scale convolutional neural network for micro-expression recognition. Frontiers in Psychology, 8:1745[DOI:10.3389/FPSYG.2017.01745]
Peng Y X, Zhao Y Z and Zhang J C. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for VideoTechnology, 29(3):773-786[DOI:10.1109/TCSVT.2018.2808685]
Pfister T, Li X B, Zhao G Y and Pietikäinen M. 2011. Recognising spontaneous facial micro-expressions//Proceedings of 2011 International Conference on Computer Vision. Barcelona: IEEE: 1449-1456[DOI: 10.1109/ICCV.2011.6126401http://dx.doi.org/10.1109/ICCV.2011.6126401]
Wang F, Jiang M Q, Qian C, Yang S, Li C, Zhang H G, Wang X G and Tang X O. 2017. Residual attention network for image classification//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 6450-6458[DOI: 10.1109/CVPR.2017.683http://dx.doi.org/10.1109/CVPR.2017.683]
Wang S J, Yan W J, Li X B, Zhao G Y, Zhou C G, Fu X L, Yang M H and Tao J H. 2015. Micro-expression recognition using color spaces. IEEE Transactions on Image Processing, 24(12):6034-6047[DOI:10.1109/TIP.2015.2496314]
Wang Y D, See J, Phan R C W and Oh Y H. 2014. LBP with six intersection points: reducing redundant information in LBP-TOP for micro-expression recognition//Proceedings of the 12th Asian Conference on Computer Vision. Singapore: Springer: 525-537[DOI: 10.1007/978-3-319-16865-4_34http://dx.doi.org/10.1007/978-3-319-16865-4_34]
Xu F, Zhang J P and Wang J Z. 2017. Microexpression identification and categorization using a facial dynamics map. IEEE Transactions on Affective Computing, 8(2):254-267[DOI:10.1109/TAFFC.2016.2518162]
Yan W J, Li X B, Wang S J, Zhao G Y, Liu Y J, Chen Y H and Fu X L. 2014. CASME II:an improved spontaneous micro-expression database and the baseline evaluation. PLoS One, 9(1):e86041[DOI:10.1371/journal.pone.0086041]
Yan W J, Wu Q, Liu Y J, Wang S J and Fu X L. 2013. CASME database: a dataset of spontaneous micro-expressions collected from neutralized faces//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai: IEEE: 1-7[DOI: 10.1109/FG.2013.6553799http://dx.doi.org/10.1109/FG.2013.6553799]
Zong Y, Huang X H, Zheng W M, Cui Z and Zhao G Y. 2018. Learning from hierarchical spatiotemporal descriptors for micro-expression recognition. IEEE Transactions on Multimedia, 20(11):3160-3172[DOI:10.1109/TMM.2018.2820321]
相关作者
相关机构