面向目标类别分类的无数据知识蒸馏方法
Data-free knowledge distillation for target class classification
- 2024年29卷第11期 页码:3401-3416
纸质出版日期: 2024-11-16
DOI: 10.11834/jig.230816
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-11-16 ,
移动端阅览
谢奕涛, 苏鹭梅, 杨帆, 陈宇涵. 2024. 面向目标类别分类的无数据知识蒸馏方法. 中国图象图形学报, 29(11):3401-3416
Xie Yitao, Su Lumei, Yang Fan, Chen Yuhan. 2024. Data-free knowledge distillation for target class classification. Journal of Image and Graphics, 29(11):3401-3416
目的
2
目前,研究者们大多采用无数据蒸馏方法解决训练数据缺乏的问题。然而,现有的无数据蒸馏方法在实际应用场景中面临着模型收敛困难和学生模型紧凑性不足的问题,为了满足针对部分类别的模型训练需求,灵活选择教师网络目标类别知识,本文提出了一种新的无数据知识蒸馏方法:面向目标类别的掩码蒸馏(masked distillation for target classes, MDTC)。
方法
2
MDTC在生成器学习原始数据的批归一化参数分布的基础上,通过掩码阻断生成网络在梯度更新过程中非目标类别的梯度回传,训练一个仅生成目标类别样本的生成器,从而实现对教师模型中特定知识的准确提取;此外,MDTC将教师模型引入到生成网络中间层的特征学习过程,优化生成器的初始参数设置和参数更新策略,加速模型收敛。
结果
2
在4个标准图像分类数据集上,设计13个子分类任务,评估MDTC在不同难度的子分类任务上的性能表现。实验结果表明,MDTC能准确高效地提取教师模型中的特定知识,不仅总体准确率优于主流的无数据蒸馏模型,而且训练耗时少。其中,40%以上学生模型的准确率甚至超过教师模型,最高提升了3.6%。
结论
2
本文方法的总体性能超越了现有无数据蒸馏模型,尤其是在简单样本分类任务的知识学习效率非常高,在提取知识类别占比较低的情况下,模型性能最优。
Objective
2
Knowledge distillation is a simple and effective method for compressing neural networks and has become a popular topic in model compression research. This method features a “teacher–student” architecture where a large network guides the training of a small network to improve its performance in application scenarios, indirectly achieving network compression. In traditional methods, the training of the student model relies on the training data of the teacher, and the quality of the student model depends on the quality of the training data. When faced with data scarcity, these methods fail to produce satisfactory results. Data-free knowledge distillation successfully addresses the issue of limited training data by introducing synthetic data. Such methods mainly synthesize training data by refining teacher network knowledge. For example, they use the intermediate representations of the teacher network for image inversion synthesis or employ the teacher network as a fixed discriminator to supervise the generator of synthetic images for training the student network. Compared with traditional methods, the training of data-free knowledge distillation does not rely on the original training data of the teacher network, which markedly expands the application scope of knowledge distillation. However, the training process may have a certain efficiency discount compared with traditional methods due to the need for additional synthetic training data. Furthermore, in practical applications, focus is often only provided on a few target classes. However, existing data-free knowledge distillation methods encounter difficulties in selectively learning the knowledge of the target class, especially when the number of teacher model classes is large, the model convergence is complex, and achieving sufficient compactness through the student model is difficult. Therefore, this paper proposes a novel data-free knowledge distillation method, namely masked distillation for target classes (MDTC). This method allows the student model to selectively learn the knowledge of target classes, maintaining good performance even in the presence of numerous classes in the teacher network. Compared to traditional methods, MDTC reduces the training difficulty and improves the training efficiency of data-free knowledge distillation.
Method
2
The MDTC method utilizes a generator to learn the batch-normalized parameter distribution of raw data and trains a generator that can generate target class samples by creating a mask to block the gradient backpropagation of non-target classes in the gradient update process of the generator. This method successfully extracts target knowledge from the teacher model while generating synthetic data that is similar to the original data. In addition, MDTC introduces the teacher model into the feature learning process of the middle layer of the generator, supervises the training of the generator, and optimizes the initial parameter settings and parameter update strategies of the generator to accelerate the convergence of the model. The MDTC algorithm is divided into two stages: the first is the data synthesis stage, which fixes the student network and only updates the generated network. During the process of generating network updates, MDTC extracts three synthetic samples from the shallow, medium, and deep layers of the generator, inputs them into the teacher network for prediction, and updates the parameters of the generation network according to the feedback of the teacher network. When updating shallow and middle layer parameters, the other layer parameters of the generated network are fixed and updated separately for that layer. Finally, when updating the output layer of the generative network, the parameters of the entire generative network are updated to gradually guide the generator to learn and synthesize the target image. The second stage is the learning stage, in which the generation network is fixed and the synthetic samples are inputted into the teacher and student networks for prediction. The target knowledge of the teacher is extracted by the mask, and Kullback-Leibler(KL) divergence is used to calculate the predicted output of the student network to update the student network.
Result
2
Four standard image classification datasets, namely, MNIST, SVHN, CIFAR10, and CIFAR100, are divided into 13 subclassification tasks by Pearson similarity calculation, including eight difficult tasks and five easy tasks. The performance of MDTC on subclassification tasks with different difficulties is evaluated by classification accuracy. The method is also compared with five mainstream data-free knowledge distillation methods and the vanilla KD method. Experimental results show that the proposed method outperforms the other mainstream data-free distillation models on 11 subtasks. Moreover, in MNIST
1
, MNIST
2
, SVHN
1
, SVHN
3
, CIFAR10
2
, and CIFAR10
4
(6 of the 13 subclassification tasks), the proposed method even surpasses the teacher model trained on the original data, achieving accuracy rates of 99.61%, 99.46%, 95.85%, 95.80%, 94.57%, and 95.00%, demonstrating a remarkable 3.6% improvement over the 91.40% accuracy of the teacher network in CIFAR10
4
.
Conclusion
2
In this study, a novel data-free knowledge distillation method, MDTC, is proposed. The experimental results indicate that MDTC outperforms existing data-free distillation models overall, especially in efficiently learning knowledge for easy sample classification tasks and when knowledge classes have a low proportion. The method displays excellent performance when extracting knowledge from a limited set of categories.
深度学习图像分类模型压缩无数据知识蒸馏生成器
deep learningimage classificationmodel compressiondata-free knowledge distillationgenerators
Arjovsky M, Chintala S and Bottou L. 2017. Wasserstein generative adversarial networks//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR.org: 214-223
Bailey K and Chopra S. 2018. Few-shot text classification with pre-trained word embeddings and a human in the loop [EB/OL]. [2023-12-12]. https://arxiv.org/pdf/1804.02063.pdfhttps://arxiv.org/pdf/1804.02063.pdf
Chen H T, Wang Y H, Xu C, Yang Z H, Liu C J, Shi B X, Xu C J, Xu C and Tian Q. 2019. Data-free learning of student networks//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3513-3521 [DOI: 10.1109/iccv.2019.00361http://dx.doi.org/10.1109/iccv.2019.00361]
Choi Y, Choi J, El-Khamy M and Lee J. 2020. Data-free network quantization with adversarial knowledge distillation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle, USA: IEEE: 3047-3057 [DOI: 10.1109/cvprw50498.2020.00363http://dx.doi.org/10.1109/cvprw50498.2020.00363]
Dai C X, Cao Y D, Zhu G M, Shen P Y, Xu X, Mei L and Zhang L. 2021. Specific knowledge learning based on knowledge distillation. Journal of Computer Applications, 41(12): 3426-3431
戴朝霞, 曹堉栋, 朱光明, 沈沛意, 徐旭, 梅林, 张亮. 2021. 基于知识蒸馏的特定知识学习. 计算机应用, 41(12): 3426-3431 [DOI: 10.11772/j.issn.1001-9081.2021060923http://dx.doi.org/10.11772/j.issn.1001-9081.2021060923]
Ding X H, Hao T X, Tan J C, Liu J, Han J G, Guo Y C and Ding G G. 2021. ResRep: lossless cnn pruning via decoupling remembering and forgetting//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 4490-4500 [DOI: 10.1109/iccv48922.2021.00447http://dx.doi.org/10.1109/iccv48922.2021.00447]
Fang G F, Song J, Shen C C, Wang X C, Chen D and Song M L. 2019. Data-free adversarial distillation [EB/OL]. [2023-12-12]. https://arxiv.org/pdf/1912.11006.pdfhttps://arxiv.org/pdf/1912.11006.pdf
Fang G F, Mo K Y, Wang X C, Song J, Bei S T, Zhang H F and Song M L. 2022. Up to 100 × faster data-free knowledge distillation//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press: 6597-6604 [DOI: 10.1609/aaai.v36i6.20613http://dx.doi.org/10.1609/aaai.v36i6.20613]
Geng Z M, Yu M Q, Liu X B and Lyu C. 2020. Combining attention mechanism and knowledge distillation for Siamese network compression. Journal of Image and Graphics, 25(12): 2563-2577
耿增民, 余梦巧, 刘峡壁, 吕超. 2020. 融合注意力机制与知识蒸馏的孪生网络压缩. 中国图象图形学报, 25(12): 2563-2577 [DOI: 10.11834/jig.200051http://dx.doi.org/10.11834/jig.200051]
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2020. Generative adversarial networks. Communications of the ACM, 63(11): 139-144 [DOI: 10.1145/3422622http://dx.doi.org/10.1145/3422622]
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V and Courville A C. 2017. Improved training of wasserstein GANs//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.
Hinton G, Vinyals O and Dean J. 2015. Distilling the knowledge in a neural network [EB/OL]. [2023-12-12]. https://arxiv.org/pdf/1503.02531.pdfhttps://arxiv.org/pdf/1503.02531.pdf
Hou Y N, Ma Z, Liu C X and Loy C C. 2019. Learning lightweight lane detection cnns by self attention distillation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 1013-1021 [DOI: 10.1109/iccv.2019.00110http://dx.doi.org/10.1109/iccv.2019.00110]
Howard J and Ruder S. 2018. Universal language model fine-tuning for text classification [EB/OL]. [2023-12-12]. https://arxiv.org/pdf/1801.06146.pdfhttps://arxiv.org/pdf/1801.06146.pdf
Huang Z H, Yang S Z, Lin W, Ni J, Sun S L, Chen Y W and Tang Y. 2022. Knowledge distillation: a survey. Chinese Journal of Computers, 45(3): 624-653
黄震华, 杨顺志, 林威, 倪娟, 孙圣力, 陈运文, 汤庸. 2022. 知识蒸馏研究综述. 计算机学报, 45(3): 624-653 [DOI: 10.11897/SP.J.1016.2022.00624http://dx.doi.org/10.11897/SP.J.1016.2022.00624]
Jiang X, Havaei M, Chartrand G, Chouaib H, Vincent T, Jesson A, Chapados N and Matwin S. 2018. On the importance of attention in meta-learning for few-shot text classification [EB/OL]. [2023-12-12]. https://arxiv.org/pdf/1806.00852.pdfhttps://arxiv.org/pdf/1806.00852.pdf
Karras T, Aila T, Laine S and Lehtinen J. 2017. Progressive growing of gans for improved quality, stability, and variation//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview.net
Li J R, Zhou S, Li L C, Wang H S, Bu J J and Yu Z. 2023. Dynamic data-free knowledge distillation by easy-to-hard learning strategy. Information Sciences, 642: #119202. [DOI: 10.2139/ssrn.4361656http://dx.doi.org/10.2139/ssrn.4361656]
Lopes R G, Fenu S and Starner T. 2017. Data-free knowledge distillation for deep neural networks [EB/OL]. [2023-12-12]. https://arxiv.org/pdf/1710.07535.pdfhttps://arxiv.org/pdf/1710.07535.pdf
Micaelli P and Storkey A J. 2019. Zero-shot knowledge transfer via adversarial belief matching//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc
Mirzadeh S I, Farajtabar M, Li A, Levine N, Matsukawa A and Ghasemzadeh H. 2020. Improved knowledge distillation via teacher assistant//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press: 5191-5198 [DOI: 10.1609/aaai.v34i04.5963http://dx.doi.org/10.1609/aaai.v34i04.5963]
Nayak G K, Mopuri K R, Shaj V, Babu V, Radhakrishnan V B and Chakraborty A. 2019. Zero-shot knowledge distillation in deep network//Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR: 4743-4751
Passalis N, Tzelepi M and Tefas A. 2020. Heterogeneous knowledge distillation using information flow modeling//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 2339-2348 [DOI: 10.1109/cvpr42600.2020.00241http://dx.doi.org/10.1109/cvpr42600.2020.00241]
Shao R R, Liu Y A, Zhang W and Wang J. 2022. A survey of knowledge distillation in deep learning. Chinese Journal of Computers, 45(8): 1638-1673
邵仁荣, 刘宇昂, 张伟, 王骏. 2022. 深度学习中知识蒸馏研究综述. 计算机学报, 45(8): 1638-1673 [DOI: 10.11897/SP.J.1016.2022.01638http://dx.doi.org/10.11897/SP.J.1016.2022.01638]
Shen C C, Wang X C, Song J, Sun L and Song M L. 2019. Amalgamating knowledge towards comprehensive classification//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI Press: 3068-3075 [DOI: 10.1609/aaai.v33i01.33013068http://dx.doi.org/10.1609/aaai.v33i01.33013068]
Tang R, Lu Y, Liu L Q, Mou L L, Vechtomova O and Lin J. 2019. Distilling task-specific knowledge from bert into simple neural networks [EB/OL]. [2023-12-12]. https://arxiv.org/pdf/1903.12136.pdfhttps://arxiv.org/pdf/1903.12136.pdf
Wang Y H, Xu C, Xu C and Tao D C. 2018. Adversarial learning of portable student networks//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI Press [DOI: 10.1609/aaai.v32i1.11667http://dx.doi.org/10.1609/aaai.v32i1.11667]
West P, Bhagavatula C, Hessel J, Hwang J D, Jiang L W, Le Bras R, Lu X M, Welleck S and Choi Y. 2021. Symbolic knowledge distillation: from general language models to commonsense models//Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, USA: ACL: #341[DOI: 10.18653/v1/2022.naacl-main.341http://dx.doi.org/10.18653/v1/2022.naacl-main.341]
Xu D, Ouyang W L, Wang X G and Sebe N. 2018. PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 675-684 [DOI: 10.1109/cvpr.2018.00077http://dx.doi.org/10.1109/cvpr.2018.00077]
Ye J W, Ji Y X, Wang X C, Ou K R, Tao D P and Song M L. 2019. Student becoming the master: knowledge amalgamation for joint scene parsing, depth estimation, and more//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2824-2833 [DOI: 10.1109/cvpr.2019.00294http://dx.doi.org/10.1109/cvpr.2019.00294]
Yi Z L, Zhang H, Tan P and Gong M L. 2017. DualGAN: unsupervised dual learning for image-to-image translation//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2868-2876 [DOI: 10.1109/iccv.2017.310http://dx.doi.org/10.1109/iccv.2017.310]
Yin H X, Molchanov P, Alvarez J M, Li Z Z, Mallya A, Hoiem D, Jha N K and Kautz J. 2020. Dreaming to distill: data-free knowledge transfer via deepinversion//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual: IEEE: 2020: 8712-8721 [DOI: 10.1109/cvpr42600.2020.00874http://dx.doi.org/10.1109/cvpr42600.2020.00874]
Yu M, Guo X X, Yi J F, Chang S Y, Potdar S, Cheng Y, Tesauro G, Wang H Y and Zhou B W. 2018. Diverse few-shot text classification with multiple metrics//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans: ACL: #1109 [DOI: 10.18653/v1/N18-1109http://dx.doi.org/10.18653/v1/N18-1109]
Yuan L, Tay F E, Li G L, Wang T and Feng J S. 2020. Revisiting knowledge distillation via label smoothing regularization//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 3902-3910 [DOI: 10.1109/cvpr42600.2020.00396http://dx.doi.org/10.1109/cvpr42600.2020.00396]
Yun S, Park J, Lee K and Shin J. 2020. Regularizing class-wise predictions via self-knowledge distillation//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13873-13882 [DOI: 10.1109/CVPR42600.2020.01389http://dx.doi.org/10.1109/CVPR42600.2020.01389]
Zhang H, Xu T, Li H S, Zhang S T, Wang X G, Huang X L and Metaxas D N. 2017. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5908-5916 [DOI: 10.1109/iccv.2017.629http://dx.doi.org/10.1109/iccv.2017.629]
Zhao H R, Sun X, Dong J Y, Manic M, Zhou H Y and Yu H. 2022. Dual discriminator adversarial distillation for data-free model compression. International Journal of Machine Learning and Cybernetics, 13(5): 1213-1230 [DOI: 10.1007/s13042-021-01443-0http://dx.doi.org/10.1007/s13042-021-01443-0]
相关作者
相关机构