多层级特征融合与双教师协作的知识蒸馏

王硕; 余璐; 徐常胜

doi:10.11834/jig.240104

图像理解和计算机视觉 | 浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

多层级特征融合与双教师协作的知识蒸馏
Knowledge distillation of multi-level feature fusion and dual-teacher collaboration
2024年29卷第12期页码：3770-3785
纸质出版日期： 2024-12-16 ，
DOI： 10.11834/jig.240104
稿件说明：

移动端阅览

王硕，余璐，徐常胜. 2024. 多层级特征融合与双教师协作的知识蒸馏. 中国图象图形学报， 29(12):3770-3785

Wang Shuo， Yu Lu， Xu Changsheng. 2024. Knowledge distillation of multi-level feature fusion and dual-teacher collaboration. Journal of Image and Graphics， 29(12):3770-3785
王硕，余璐，徐常胜. 2024. 多层级特征融合与双教师协作的知识蒸馏. 中国图象图形学报， 29(12):3770-3785 DOI： 10.11834/jig.240104.

Wang Shuo， Yu Lu， Xu Changsheng. 2024. Knowledge distillation of multi-level feature fusion and dual-teacher collaboration. Journal of Image and Graphics， 29(12):3770-3785 DOI： 10.11834/jig.240104.

摘要

目的

知识蒸馏旨在不影响原始模型性能的前提下，将一个性能强大且参数量也较大的教师模型的知识迁移到一个轻量级的学生模型上。在图像分类领域，以往的蒸馏方法大多聚焦于全局信息的提取而忽略了局部信息的重要性。并且这些方法多是围绕单教师架构蒸馏，忽视了学生可以同时向多名教师学习的潜力。因此，提出了一种融合全局和局部特征的双教师协作知识蒸馏框架。

方法

首先随机初始化一个教师（临时教师）与学生处理全局信息进行同步训练，利用其临时的全局输出逐步帮助学生以最优路径接近教师的最终预测。同时又引入了一个预训练的教师（专家教师）处理局部信息。专家教师将局部特征输出分离为源类别知识和其他类别知识并分别转移给学生以提供较为全面的监督信息。

结果

在CIFAR-100（Canadian Institute for Advanced Research）和Tiny-ImageNet数据集上进行实验并与其他蒸馏方法进行了比较。在CIFAR-100数据集中，与最近的NKD（normalized knowledge distillation）相比，在师生相同架构与不同架构下，平均分类准确率分别提高了0.63%和1.00%。在Tiny-ImageNet数据集中，ResNet34（residual network）和MobileNetV1的师生组合下，分类准确率相较于SRRL（knowledge distillation via softmax regression representation learning）提高了1.09%，相较于NKD提高了1.06%。同时也在CIFAR-100数据集中进行了消融实验和可视化分析以验证所提方法的有效性。

结论

本文所提出的双教师协作知识蒸馏框架，融合了全局和局部特征，并将模型的输出响应分离为源类别知识和其他类别知识并分别转移给学生，使得学生模型的图像分类结果具有更高的准确率。

Abstract

Objective

Knowledge distillation aims to transfer the knowledge of a teacher model with a powerful performance and a large number of parameters to a lightweight student model and improve its performance without affecting the performance of the original model. Previous research on knowledge distillation mostly focus on the direction of knowledge distillation from one teacher to one student and neglect the potential for students to learn from multiple teachers simultaneously. Multi-teacher distillation can help the student model synthesize the knowledge of each teacher model， thereby improving its expressive ability. A few studies have examined the distillation of teacher models across these different situations， and learning from multiple teachers at the same time can integrate additional useful knowledge and information and consequently improve student performance. In addition， most of the existing knowledge distillation methods only focus on the global information of the image and ignore the importance of spatial local information. In image classification， local information refers to the features and details of specific regions in the image， including textures， shapes， and boundaries， which play important roles in distinguishing various image categories. The teacher network can distinguish local regions based on these details and make accurate predictions for similar appearances in different categories， but the student network may fail to predict. To address these issues， this article proposes a knowledge distillation method based on global and local dual-teacher collaboration， which integrates global and local information and effectively improve the classification accuracy of the student model.

Method

The original input image is initially represented as global and local image views. The original image （global image view） is randomly cropped locally， and the ratio of the cropped area to the original image is specified within 40%～70% to obtain local input information （local image view）. Afterward， a teacher （scratch teacher） is randomly initialize to synchronize training with the student in processing global information， and its scratch global feature output is used to gradually help students approach the teacher’s final prediction with the optimal path. Meanwhile， a pre-trained teacher （expert teacher） is introduced to process local information. The proposed method uses a dual-teacher distillation architecture to jointly train the student network on the premise of integrating global and local features. On the one hand， the scratch teacher works with the student to train and process global information from scratch. By introducing the scratch teacher， it is no longer just the final smooth output of the pre-trained model （expert teacher）； instead， it uses its temporary output to gradually help the student model， thus forcing the latter to approach the final output logits with higher accuracy through the optimal path. During the training process， the student model obtains not only the difference between the target and the scratch output but also the possible path to the final goal provided by a complex model with strong learning ability. On the other hand， the expert teacher processes local information and separates the output local features into source category knowledge and other category knowledge. In this collaborative teaching， the student model reaches a local optimum， and its performance becomes close to that of the teacher model.

Result

The proposed method is compared with other knowledge distillation methods being used in the field of image classification. The experimental datasets include CIFAR-100 and Tiny-ImageNet， and image classification accuracy is used as the evaluation index. On the CIFAR-100 dataset， compared with the optimal feature distillation method SemCKD， the average distillation accuracy of the proposed method increased by 0.62% under the same architecture of teachers and students. In the case of heterogeneous teachers and students， the average accuracy rate of the proposed method increased by 0.89%. Compared with the state-of-the-art response distillation method NKD， the average classification accuracy of the proposed method increased by 0.63% and 1.00% in the cases of homogeneous and heterogeneous teachers and students， respectively. On the Tiny-ImageNet dataset， the teacher network is ResNet34， and the student network is ResNet18. The final test accuracy of the proposed method reached its optimal level at 68.86%， which was 0.74% higher than that of NKD and other competing models. This method also achieved the highest classification accuracy in the case of different teacher and student architecture combinations. Ablation experiments and visual analysis are also conducted on CIFAR-100 to demonstrate the effectiveness of the proposed method.

Conclusion

A dual-teacher collaborative knowledge distillation framework that integrates global and local information is proposed in this paper. This method separates the teacher–student output features into source categories and other category knowledge and transfers them to the students separately. Experimental results show that the proposed method outperforms several state-of-the-art knowledge distillation methods in the field of image classification and can significantly improve the performance of the student model.

关键词

知识蒸馏（KD）图像分类轻量级模型协作蒸馏特征融合

Keywords

knowledge distillation（KD）image classificationlightweight modelcollaborative distillationfeature fusion

references

Ahn S， Hu S X， Damianou A， Lawrence N D and Dai Z W. 2019. Variational information distillation for knowledge transfer//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 9155-9163 ［DOI： 10.1109/CVPR.2019.00938http://dx.doi.org/10.1109/CVPR.2019.00938］

Chen D， Mei J P， Zhang Y， Wang C， Wang Z， Feng Y and Chen C. 2021a. Cross-layer distillation with semantic calibration//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Virtually： AAAI： 7028-7036 ［DOI： 10.1609/aaai.v35i8.16865http://dx.doi.org/10.1609/aaai.v35i8.16865］

Chen P G， Liu S， Zhao H S and Jia J Y. 2021b. Distilling knowledge via knowledge review//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 5008-5017 ［DOI： 10.1109/CVPR46437.2021.00497http://dx.doi.org/10.1109/CVPR46437.2021.00497］

Chen X J， Su J B and Zhang J. 2019. A two-teacher framework for knowledge distillation//16th International Symposium on Neural Networks. Moscow， Russia： Springer： 58-66 ［DOI： 10.1007/978-3-030-22796-8_7http://dx.doi.org/10.1007/978-3-030-22796-8_7］

Chen Z L， Zheng X X， Shen H L， Zeng Z Y， Zhou Y K and Zhao R C. 2020. Improving knowledge distillation via category structure//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 205-219 ［DOI： 10.1007/978-3-030-58604-1_13http://dx.doi.org/10.1007/978-3-030-58604-1_13］

Feng Z X， Lai J H and Xie X H. 2021. Resolution-aware knowledge distillation for efficient inference. IEEE Transactions on Image Processing， 30： 6985-6996 ［DOI： 10.1109/tip.2021.3101158http://dx.doi.org/10.1109/tip.2021.3101158］

Fukuda T， Suzuki M， Kurata G， Thomas S， Cui J and Ramabhadran B. 2017. Efficient knowledge distillation from an ensemble of teachers//Interspeech 2017. Stockholm， Sweden： ISCA： 3697-3701 ［DOI： 10.21437/interspeech.2017-614http://dx.doi.org/10.21437/interspeech.2017-614］

Gao H， Tian Y L， Xu F Y and Zhong S. 2021. Survey of deep learning model compression and acceleration. Journal of Software， 32（1）： 68-92

高晗，田育龙，许封元，仲盛. 2021. 深度学习模型压缩与加速综述. 软件学报， 32（1）： 68-92 ［DOI： 10.13328/j.cnki.jos.006096http://dx.doi.org/10.13328/j.cnki.jos.006096］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/cvpr.2016.90http://dx.doi.org/10.1109/cvpr.2016.90］

Heo B， Kim J， Yun S， Park H， Kwak N and Choi J Y. 2019. A comprehensive overhaul of feature distillation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul， Korea （South）： IEEE： 1921-1930 ［DOI： 10.1109/ICCV.2019.00201http://dx.doi.org/10.1109/ICCV.2019.00201］

Hinton G， Vinyals O and Dean J. 2015. Distilling the knowledge in a neural network ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/1503.02531.pdfhttps://arxiv.org/pdf/1503.02531.pdf

Huang T， You S， Wang F， Qian C and Xu C. 2022. Knowledge distillation from a stronger teacher ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/2205.10536.pdfhttps://arxiv.org/pdf/2205.10536.pdf

Krizhevsky A. 2009. Learning Multiple Layers of Features from Tiny Images. Toronto， Canada： University of Toronto

Li X J， Wu J L， Fang H Y， Liao Y， Wang F and Qian C. 2020. Local correlation consistency for knowledge distillation//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 18-33 ［DOI： 10.1007/978-3-030-58610-2_2http://dx.doi.org/10.1007/978-3-030-58610-2_2］

Liu X Y， Leonardi A， Yu L， Gilmer-Hill C， Leavitt M and Frankle J. 2023. Knowledge distillation for efficient sequences of training runs ［EB/OL］. ［2023-12-21］. https://arxiv.org/pdf/2303.06480.pdfhttps://arxiv.org/pdf/2303.06480.pdf

Liu Y A， Zhang W and Wang J. 2020. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing， 415： 106-113 ［DOI： 10.1016/j.neucom.2020.07.048http://dx.doi.org/10.1016/j.neucom.2020.07.048］

Luo S H， Wang X C， Fang G F， Hu Y， Tao D P and Song M L. 2019. Knowledge amalgamation from heterogeneous networks by common feature learning ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/1906.10546.pdfhttps://arxiv.org/pdf/1906.10546.pdf

Ma N N， Zhang X Y， Zheng H T and Sun J. 2018. ShuffleNet V2： practical guidelines for efficient CNN architecture design//Proceedings of the 15th European Conference on Computer Vision （ECCV）. Munich， Germany： IEEE： 122-138 ［DOI： 10.1007/978-3-030-01264-9_8http://dx.doi.org/10.1007/978-3-030-01264-9_8］

Mao Y H， Cao J， He P C， Liu X and Chai B. 2023. A survey of pruning methods based on deep neural networks. Microelectronics and Computer， 40（10）： 1-8

毛远宏，曹健，贺鹏超，刘曦，柴波. 2023. 深度神经网络剪枝方法综述. 微电子学与计算机， 40（10）： 1-8 ［DOI： 10.19304/J.ISSN1000-7180.2023.0299http://dx.doi.org/10.19304/J.ISSN1000-7180.2023.0299］

Park W， Kim D， Lu Y and Cho M. 2019. Relational knowledge distillation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 3962-3971 ［DOI： 10.1109/CVPR.2019.00409http://dx.doi.org/10.1109/CVPR.2019.00409］

Passalis N， Tzelepi M and Tefas A. 2021. Probabilistic knowledge transfer for lightweight deep representation learning. IEEE Transactions on Neural Networks and Learning Systems， 32（5）： 2030-2039 ［DOI： 10.1109/TNNLS.2020.2995884http://dx.doi.org/10.1109/TNNLS.2020.2995884］

Peng B Y， Jin X， Li D S， Zhou S F， Wu Y C， Liu J H， Zhang Z N and Liu Y. 2019. Correlation congruence for knowledge distillation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul， Korea （South）： IEEE： 5006-5015 ［DOI： 10.1109/iccv.2019.00511http://dx.doi.org/10.1109/iccv.2019.00511］

Romero A， Ballas N， Kahou S E， Chassang A， Gatta C and Bengio Y. 2015. FitNets： hints for thin deep nets ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/1412.6550.pdfhttps://arxiv.org/pdf/1412.6550.pdf

Russakovsky O， Deng J， Su H， Krause J， Satheesh S， Ma S A， Huang Z H， Karpathy A， Khosla A， Bernstein M， Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision， 115（3）： 211-252 ［DOI： 10.1007/s11263-015-0816-yhttp://dx.doi.org/10.1007/s11263-015-0816-y］

Sandler M， Howard A， Zhu M L， Zhmoginov A and Chen L C. 2018. MobileNetV2： inverted residuals and linear bottlenecks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Salt Lake City， USA： IEEE： 4510-4520 ［DOI： 10.1109/cvpr.2018.00474http://dx.doi.org/10.1109/cvpr.2018.00474］

Sengupta A， Ye Y T， Wang R， Liu C A and Roy K. 2019. Going deeper in spiking neural networks： VGG and residual architectures. Frontiers in Neuroscience， 13： #95 ［DOI： 10.3389/fnins.2019.00095http://dx.doi.org/10.3389/fnins.2019.00095］

Si Z F and Qi H G. 2023. Survey on knowledge distillation and its application. Journal of Image and Graphics， 28（9）： 2817-2832

司兆峰，齐洪钢. 2023. 知识蒸馏方法研究与应用综述. 中国图象图形学报， 28（9）： 2817-2832 ［DOI： 10.11834/jig.220273http://dx.doi.org/10.11834/jig.220273］

Sun H R， Wang W and Chen H B. 2020. Lightweight image compression neural network based on parameter quantization. Information Technology， 44（10）： 87-91

孙浩然，王伟，陈海宝. 2020. 基于参数量化的轻量级图像压缩神经网络研究. 信息技术， 44（10）： 87-91 ［DOI： 10.13274/j.cnki.hdzj.2020.10.016http://dx.doi.org/10.13274/j.cnki.hdzj.2020.10.016］

Tian Y L， Krishnan D and Isola P. 2022. Contrastive representation distillation ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/1910.10699.pdfhttps://arxiv.org/pdf/1910.10699.pdf

Tung F and Mori G. 2019. Similarity-preserving knowledge distillation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul， Korea （South）： IEEE： 1365-1374 ［DOI： 10.1109/ICCV.2019.00145http://dx.doi.org/10.1109/ICCV.2019.00145］

Wang D W， Liu B C， Han Z， Wang Y M and Tang Y D. 2024. Deep network compression method based on low-rank decomposition and vector quantization. Journal of Computer Applications， 44（7）： 1987-1994 ［2023-12-21］

王东炜，刘柏辰，韩志，王艳美，唐延东. 2024. 基于低秩分解和向量量化的深度网络压缩方法. 计算机应用， 44（7）：1987-1994 ［DOI： 10.11772/j.issn.1001-9081.2023071027http://dx.doi.org/10.11772/j.issn.1001-9081.2023071027］

Yang C G， Yu X Q， An Z L and Xu Y J. 2023a. Categories of response-based， feature-based， and relation-based knowledge distillation//Pedrycz W and Chen S M， eds. Advancements in Knowledge Distillation： Towards New Horizons of Intelligent Systems. Cham： Springer： 1-32 ［DOI： 10.1007/978-3-031-32095-8_1http://dx.doi.org/10.1007/978-3-031-32095-8_1］

Yang J， Martinez B， Bulat A and Tzimiropoulos G. 2021. Knowledge distillation via softmax regression representation learning//Proceedings of the 9th International Conference on Learning Representations （ICLR）， Virtual Event， Austria

Yang Z D， Zeng A L， Li Z， Zhang T K， Yuan C and Li Y. 2023b. From knowledge distillation to self-knowledge distillation： a unified approach with normalized loss and customized soft labels ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/2303.13005.pdfhttps://arxiv.org/pdf/2303.13005.pdf

You S， Xu C， Xu C and Tao D C. 2017. Learning from multiple teacher networks//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Halifax， Canada： ACM： 1285-1294 ［DOI： 10.1145/3097983.3098135http://dx.doi.org/10.1145/3097983.3098135］

Zagoruyko S and Komodakis N. 2017a. Wide residual networks ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/1605.07146.pdfhttps://arxiv.org/pdf/1605.07146.pdf

Zagoruyko S and Komodakis N. 2017b. Paying more attention to attention： improving the performance of convolutional neural networks via attention transfer ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/1612.03928.pdfhttps://arxiv.org/pdf/1612.03928.pdf

Zhao B R， Cui Q， Song R J， Qiu Y Y and Liang J J. 2022. Decoupled knowledge distillation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 11943-11952 ［DOI： 10.1109/CVPR52688.2022.01165http://dx.doi.org/10.1109/CVPR52688.2022.01165］

Zhou G R， Fan Y， Cui R P， Bian W J， Zhu X Q and Gai K. 2018. Rocket launching： a universal and efficient framework for training well-performing light net//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans， USA： AAAI： 4580-4587 ［DOI： 10.1609/aaai.v32i1.11601http://dx.doi.org/10.1609/aaai.v32i1.11601］

Zhou H L， Song L C， Chen J J， Zhou Y， Wang G L， Yuan J S and Zhang Q. 2021. Rethinking soft labels for knowledge distillation： a bias-variance tradeoff perspective ［EB/OL］. ［2024-02-07］. https://arxiv.org/pdf/2102.00650.pdfhttps://arxiv.org/pdf/2102.00650.pdf

文章被引用时，请邮件提醒。

提交