多层混合注意力机制的类激活图可解释性方法
Spatial Attention-based Multi-layer Fusion Method For High-Quality Class Activation Maps
- 2024年 页码:1-15
网络出版日期: 2024-12-23
DOI: 10.11834/jig.240216
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-12-23 ,
移动端阅览
张剑,张一然,王梓聪.多层混合注意力机制的类激活图可解释性方法[J].中国图象图形学报,
Zhang Yiran,Zhang Jian,Shao Jiang.Spatial Attention-based Multi-layer Fusion Method For High-Quality Class Activation Maps[J].Journal of Image and Graphics,
目的
2
深度卷积神经网络在视觉任务中的广泛应用,使得其作为黑盒模型的复杂性和不透明性引发了对决策机制的关注。类激活图已被证明能有效提升图像分类的可解释性从而提高决策机制的理解程度,但现有方法在高亮目标区域时,常存在边界模糊、范围过大和细粒度不足的问题。为此,提出了一种多层混合注意力机制的类激活图方法(spatial attention-based multi-layer fusion for high-quality class activation maps,SAMLCAM),以优化这些局限性。
方法
2
在以往的类激活图方法中忽略了空间位置信息只关注通道级权重,降低目标物体的定位性能,在所提出的SAMLCAM方法中提出一种结合了通道注意力机制和空间注意力机制的混合注意力机制,实现增强目标物体定位减少无效位置信息的效果。在得到有效物体定位结果之后,根据神经网络多层卷积层的特点,改进多层特征图融合的方式提出多层加权融合机制,改善类激活图的边界效果范围过大和细粒度不足的问题,从而增强类激活图的视觉解释性。
结果
2
引用广泛用于计算机视觉模型的基准测试ILSVRC 2012数据集和MS COCO2017数据集,对提出方法在多种待解释卷积网络模型下进行评估,包括消融实验、定性评估和定量评估。消融实验中证明了各模块的有效性;同时定性评估对其可解释性效果进行视觉直观展示,证明效果的提升;定量评估中数据表明,SAMLCAM在Loc1和Loc5指标性能比较中相较于最低数据均有大于7%的提升,在能量定位决策指标的比较中相较于最低数据均有大于9.85%的提升。由于改进方法减少了目标样本区域的上下文背景区域,使得其对结果置信度存在负影响,但在可信度指标中,与其他方法比较仍可以保持不超过2%的差距并维持较高性能。
Objective
2
The success of Deep Convolutional Neural Networks (DCNNs) in image classification, object detection, and semantic segmentation has revolutionized the field of artificial intelligence. These models have demonstrated exceptional accuracy and have been deployed in various real-world applications. However, a major drawback of DCNNs is their lack of interpretability, often referred to as the "black-box" problem. When a DCNN makes a prediction, it is challenging to understand how and why it arrived at that decision. This lack of transparency hinders our ability to trust and rely on the model's outputs, especially in critical domains such as healthcare, autonomous driving, and finance. For instance, in medical diagnosis, it is crucial for healthcare professionals to comprehend the reasoning behind a model's diagnosis to make informed decisions about patient care. Explainable artificial intelligence (XAI) aims to address this issue by providing human-interpretable explanations for the decisions made by complex machine learning models. XAI seeks to bridge the gap between model performance and model interpretability, allowing users to understand the inner workings of the model and have confidence in its outputs. Researchers have been actively developing techniques and methods to enhance the interpretability of deep learning models. One approach is to generate visual explanations through techniques like CAMs, Grad-CAM, and Smooth Grad-CAM. These methods provide heatmaps or attention maps that highlight the areas of an input image that influenced the model's decision the most. By visualizing this information, users can gain insights into the features and patterns the model focuses on when making predictions. Experimental evidence has shown that class activation map methods can effectively enhance the interpretability of image classification. However, existing methods only provide rough range explanations and suffer from the issues of excessively large boundary effects and insufficient granularity.To tackle these problems, spatial attention-based multi-layer fusion for high-quality class activation maps(SAMLCAM) is proposed. It combines channel attention mechanisms and spatial attention mechanisms based on Grad-CAM. SAMLCAM achieves more effective object localization and enhances visual interpretability by addressing the issues of excessively large activation map boundaries and lack of fine granularity through multi-layer fusion.
Method
2
In the current class activation map methods, only the channel weights are considered, while the beneficial information from spatial position, which contributes to target localization, is often overlooked. In our paper, a hybrid attention mechanism combining channel attention and spatial attention is proposed to enhance the interpretability of target localization. The spatial attention mechanism focuses on the spatial relationship among different regions in the feature maps. By assigning higher weights to regions that are more likely to contain the target object, SAMLCAM can enhance the precision of object localization while reducing false positives. This attention mechanism allows the model to allocate more attention to discriminative features, leading to improved object localization. One key improvement of SAMLCAM lies in its multi-layer attention mechanism. Previous methods often suffer from boundary effects, where the activation maps tend to have excessively large boundaries that might include irrelevant regions. SAMLCAM addresses this issue by refining the attention maps at multiple layers of the network. It not only relies on the results from the final convolutional layer but also takes into account multiple aspects, including the attention to shallow layers. This enriches the reference information, resulting in a more comprehensive understanding of the semantic information of the target object while reducing unnecessary background information. This multi-layer attention mechanism helps to gradually refine the boundaries and improve the localization accuracy by reducing the influence of irrelevant regions. Moreover, SAMLCAM tackles the problem of insufficient granularity in class activation maps. In some cases, the activation maps generated by previous methods lack fine details, making it challenging to precisely identify the object of interest. SAMLCAM overcomes this limitation by leveraging the multi-layer attention mechanism to capture more detailed information in the activation maps. This results in high-quality class activation maps with enhanced visual interpretability. The ILSVRC 2012 dataset is a large-scale image classification dataset, consisting of over a million labeled images from 1,000 different categories, widely used for benchmarking computer vision models. The evaluation results on the ILSVRC 2012 validation dataset demonstrate the effectiveness of SAMLCAM in improving object localization metrics and energy localization decision metrics. The proposed method contributes to the field by offering a more comprehensive understanding of how deep models make decisions in visual tasks and provides insights into improving their interpretability. The proposed SAMLCAM method is evaluated on five backbone convolutional network models using the ILSVRC 2012 validation dataset, compared with five state-of-the-art saliency models, namely, Grad-CAM, Grad-CAM++, XGradCAM、ScoreCAM, LayerCAM. The results demonstrate the performance improvement of SAMLCAM compared to the lowest-performing methods in both Loc1 and Loc5 metrics, with an increase of over 7%. Additionally, when comparing energy localization decision metrics, SAMLCAM shows an improvement of more than 9.85% compared to the lowest-performing methods. It should be noted that the improved method reduces the contextual background areas surrounding the target sample region, which negatively affects the confidence metric. However, in terms of the credibility metric, SAMLCAM maintains a difference of no more than 2% compared to other methods. In addition, we conducted a series of comparative experiments to clearly demonstrate the effectiveness of the fusion algorithm in the form of images.
Result
2
In conclusion, the SAMLCAM method presents a novel approach to enhancing the interpretability of deep convolutional neural network models. By incorporating channel attention and spatial attention mechanisms, it improves object localization and overcomes the limitations of previous methods such as excessive boundary effects and lack of fine granularity in class activation maps. The evaluation results on the ILSVRC 2012 dataset highlight the performance improvement of SAMLCAM compared to other methods in terms of localization metrics and energy localization decision metrics. The proposed method contributes to advancing the field of visual deep learning and offers valuable insights into understanding and improving the interpretability of black-box models.
类激活图人工智能解释性注意力机制特征归因图像分类卷积神经网络
class activation mapexplainable artificial intelligenceattention mechanismfeature attributionimage classificationconvolutional neural network
Ali M N Y, Rahman M L, Chaki J, Dey N and Santosh K C. 2021. Machine translation using deep learning for universal networking language based on their structure. International Journal of Machine Learning and Cybernetics, 12(8): 2365-2376. [DOI: 10.1007/ s13042-021-01317-5http://dx.doi.org/10.1007/s13042-021-01317-5]
Arrieta A B, Díaz-Rodríguez N, Del Ser J, Bennetot A and Herrera F. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, 58: 82-115. [DOI: 10.48550/arXiv.1910.10045http://dx.doi.org/10.48550/arXiv.1910.10045]
Chattopadhay A, Sarkar A, Howlader P and Balasubramanian VN. 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks//2018 IEEE winter conference on applications of computer vision (WACV). USA: IEEE: 839-847. [DOI: 10.48550/arXiv.1710.11063http://dx.doi.org/10.48550/arXiv.1710.11063]
Chen Z, Sun Q. 2023. Extracting class activation maps from non-discriminative features as well//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. USA: IEEE: 3135-3144.
Dong Y, Su H, Wu B, Li Z, Liu W, Zhang T and Zhu J. 2019. Efficient decision-based black-box adversarial attacks on face recognition//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA: IEEE: 7714-7722. [DOI: 10.1109/CVPR.2019.00790http://dx.doi.org/10.1109/CVPR.2019.00790]
Dong Y,Su H, Zhu J. 2022. Interpretability analysis of deep neural networks with adversarial examples. Acta Automatica Sinica, 48(1): 75-86
董胤蓬, 苏航, 朱军. 2022. 面向对抗样本的深度神经网络可解释性分析. 自动化学报, 48(1): 75-86.[DOI: 10.16383/j.aas.c200317http://dx.doi.org/10.16383/j.aas.c200317]
Dou H, Zhang L M, Han F, Shen F R and Zhao J. 2023. Survey on Convolutional Neural Network Interpretability. Journal of Software, 35(1): 159-184.
窦慧, 张凌茗, 韩峰, 申富饶, 赵健. 2023. 卷积神经网络的可解释性研究综述. 软件学报, 35(1): 159-184. [DOI: 10.13328/j.cnki.jos.006758http://dx.doi.org/10.13328/j.cnki.jos.006758]
Englebert A, Cornu O and Vleeschouwer C D. 2004. Poly-cam: high resolution class activation map for convolutional neural networks. Machine Vision and Applications, 35(4): 89. [DOI:10.1007/s00138-024-01567-7http://dx.doi.org/10.1007/s00138-024-01567-7]
Fong R C and Vedaldi A. 2017. Interpretable explanations of black boxes by meaningful perturbation//Proceedings of the IEEE international conference on computer vision. USA: IEEE: 3429-3437. [DOI: 10.1109/ICCV.2017.371http://dx.doi.org/10.1109/ICCV.2017.371]
Fu R, Hu Q, Dong X, Guo Y and Li B. 2020. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. BMVC2020 Oral. [DOI:10.48550/arXiv.2008.02312].
Guidotti R, Monreale A, Turini F and Giannotti F. 2018. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5): 1-42. [DOI: 10.1145/3236009http://dx.doi.org/10.1145/3236009]
He K, Zhang X, Ren S and Sun J. 2016. Deep residual learning for image recognition//Proceedings of the IEEE conference on computer vision and pattern recognition. USA: IEEE: 770-778. [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
He W and Pan C. 2022. The salient object detection based on attention-guided network. Journal of Image and Graphics,27(04):1176-1190
何伟,潘晨. 2022. 注意力引导网络的显著性目标检测. 中国图象图形学报,27(04):1176-1190[DOI:10. 11834 / jig. 200658http://dx.doi.org/10.11834/jig.200658]
Huang G, Liu Z, Van Der Maaten L and Weinberger KQ. 2017. Densely connected convolutional networks//Proceedings of the IEEE conference on computer vision and pattern recognition. USA. IEEE: 4700-4708. [DOI: 10.1109/CVPR.2017.243http://dx.doi.org/10.1109/CVPR.2017.243]
Hua Y, Zhang D and Ge S. 2020. Research progress in the interpretability of deep learning models. Journal of Cyber Security, 5(3): 1-12. [DOI: 10.19363/J.cnki.cn10-1380/tn.2020.05.01http://dx.doi.org/10.19363/J.cnki.cn10-1380/tn.2020.05.01]
Hu J, Shen L, Albanie S, Sun G and Wu E. 2020. Squeeze-and-Excitation Networks..IEEE transactions on pattern analysis and machine intelligence, 42(8):2011-2023. [DOI:10.1109/TPAMI.2019.2913372http://dx.doi.org/10.1109/TPAMI.2019.2913372.]
Jiang P T, Zhang C B, Hou Q, Cheng M and Wei Y. 2021. Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing, 30: 5875-5888. [DOI: 10.1109/TIP.2021.3089943http://dx.doi.org/10.1109/TIP.2021.3089943]
Krizhevsky A, Sutskever I, Hinton G E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25. [DOI: 10.1145/3065386http://dx.doi.org/10.1145/3065386]
Linardatos P, Papastefanopoulos V, Kotsiantis S. 2020. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1): 18. [DOI: 10.48550/arXiv.1905.11474http://dx.doi.org/10.48550/arXiv.1905.11474]
Lin T Y, Maire M, Belongie S, Bourdev L, Girshick R, Hays J; Perona P; Ramanan D; Dollár P and Zitnick C. 2014. Microsoft coco: Common objects in context//Computer Vision–ECCV 2014. Switzerland:. Springer International Publishing: 740-755. [DOI:10.1007 /978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Li X, Xiong H, Li X, Wu X, Zhang X, Liu J, Bian J and Dou D. 2022. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowledge and Information Systems, 64(12): 3197-3234. [DOI: 10.48550/arXiv.2103.10689http://dx.doi.org/10.48550/arXiv.2103.10689]
Omeiza D, Speakman S, Cintas C and Weldermariam K. 2019. Smooth grad-cam++: An enhanced inference level visualization technique for deep convolutional neural network models. [EB/OL].[2024-02-22]. [DOI:10.48550/arXiv.1908.01224http://dx.doi.org/10.48550/arXiv.1908.01224.]
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D and Batra D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization//Proceedings of the IEEE international conference on computer vision. USA: IEEE: 618-626. [DOI: 10.1007/s11263-019-01228-7http://dx.doi.org/10.1007/s11263-019-01228-7]
Simonyan K , Zisserman A . Very Deep Convolutional Networks for Large-Scale Image Recognition.[EB/OL].[2018-10-14] https://arxiv.org/pdf/1409.1556https://arxiv.org/pdf/1409.1556
Wang H, Wang Z, Du M, Yang F, Zhang Z, Ding S, Mardziel P and Hu X. 2020. Score-CAM: Score-weighted visual explanations for convolutional neural networks//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. USA: IEEE: 24-25. [DOI: 10.1109/CVPRW50498.2020.00020http://dx.doi.org/10.1109/CVPRW50498.2020.00020]
Zhang Y, Tiňo P, Leonardis A and Tang K. 2021. A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5): 726-742. [DOI: 10.1109/TETCI.2021.3100641http://dx.doi.org/10.1109/TETCI.2021.3100641]
Zheng Q, Wang Z, Zhou J and Lu J. 2022. Shap-CAM: Visual Explanations for Convolutional Neural Networks based on Shapley Value// Proceedings of 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 459-474. [DOI: 10.1007/978-3-031-19775-8_27http://dx.doi.org/10.1007/978-3-031-19775-8_27]
Zhou B, Khosla A, Lapedriza A, Oliva A and Torralba A. 2016: Learning deep features for discriminative localization//Proceedings of the IEEE conference on computer vision and pattern recognition. USA: IEEE: 2921-29 Computer Vision – ECCV 2022
国信办. 2021 .《关于加强互联网信息服务算法综合治理的指导意见》[EB/OL]. [2024-02-22] https://www.gov.cn/zhengce/ zhengceku/2021-09/30/content_5640398.htmhttps://www.gov.cn/zhengce/zhengceku/2021-09/30/content_5640398.htm
相关作者
相关机构