面向目标类别分类的无数据知识蒸馏方法
Data-free knowledge distillation for target class classification
- 2024年29卷第11期 页码:3401-3416
收稿日期:2023-12-12,
修回日期:2024-02-29,
纸质出版日期:2024-11-16
DOI: 10.11834/jig.230816
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2023-12-12,
修回日期:2024-02-29,
纸质出版日期:2024-11-16
移动端阅览
目的
2
目前,研究者们大多采用无数据蒸馏方法解决训练数据缺乏的问题。然而,现有的无数据蒸馏方法在实际应用场景中面临着模型收敛困难和学生模型紧凑性不足的问题,为了满足针对部分类别的模型训练需求,灵活选择教师网络目标类别知识,本文提出了一种新的无数据知识蒸馏方法:面向目标类别的掩码蒸馏(masked distillation for target classes, MDTC)。
方法
2
MDTC在生成器学习原始数据的批归一化参数分布的基础上,通过掩码阻断生成网络在梯度更新过程中非目标类别的梯度回传,训练一个仅生成目标类别样本的生成器,从而实现对教师模型中特定知识的准确提取;此外,MDTC将教师模型引入到生成网络中间层的特征学习过程,优化生成器的初始参数设置和参数更新策略,加速模型收敛。
结果
2
在4个标准图像分类数据集上,设计13个子分类任务,评估MDTC在不同难度的子分类任务上的性能表现。实验结果表明,MDTC能准确高效地提取教师模型中的特定知识,不仅总体准确率优于主流的无数据蒸馏模型,而且训练耗时少。其中,40%以上学生模型的准确率甚至超过教师模型,最高提升了3.6%。
结论
2
本文方法的总体性能超越了现有无数据蒸馏模型,尤其是在简单样本分类任务的知识学习效率非常高,在提取知识类别占比较低的情况下,模型性能最优。
Objective
2
Knowledge distillation is a simple and effective method for compressing neural networks and has become a popular topic in model compression research. This method features a “teacher–student” architecture where a large network guides the training of a small network to improve its performance in application scenarios, indirectly achieving network compression. In traditional methods, the training of the student model relies on the training data of the teacher, and the quality of the student model depends on the quality of the training data. When faced with data scarcity, these methods fail to produce satisfactory results. Data-free knowledge distillation successfully addresses the issue of limited training data by introducing synthetic data. Such methods mainly synthesize training data by refining teacher network knowledge. For example, they use the intermediate representations of the teacher network for image inversion synthesis or employ the teacher network as a fixed discriminator to supervise the generator of synthetic images for training the student network. Compared with traditional methods, the training of data-free knowledge distillation does not rely on the original training data of the teacher network, which markedly expands the application scope of knowledge distillation. However, the training process may have a certain efficiency discount compared with traditional methods due to the need for additional synthetic training data. Furthermore, in practical applications, focus is often only provided on a few target classes. However, existing data-free knowledge distillation methods encounter difficulties in selectively learning the knowledge of the target class, especially when the number of teacher model classes is large, the model convergence is complex, and achieving sufficient compactness through the student model is difficult. Therefore, this paper proposes a novel data-free knowledge distillation method, namely masked distillation for target classes (MDTC). This method allows the student model to selectively learn the knowledge of target classes, maintaining good performance even in the presence of numerous classes in the teacher network. Compared to traditional methods, MDTC reduces the training difficulty and improves the training efficiency of data-free knowledge distillation.
Method
2
The MDTC method utilizes a generator to learn the batch-normalized parameter distribution of raw data and trains a generator that can generate target class samples by creating a mask to block the gradient backpropagation of non-target classes in the gradient update process of the generator. This method successfully extracts target knowledge from the teacher model while generating synthetic data that is similar to the original data. In addition, MDTC introduces the teacher model into the feature learning process of the middle layer of the generator, supervises the training of the generator, and optimizes the initial parameter settings and parameter update strategies of the generator to accelerate the convergence of the model. The MDTC algorithm is divided into two stages: the first is the data synthesis stage, which fixes the student network and only updates the generated network. During the process of generating network updates, MDTC extracts three synthetic samples from the shallow, medium, and deep layers of the generator, inputs them into the teacher network for prediction, and updates the parameters of the generation network according to the feedback of the teacher network. When updating shallow and middle layer parameters, the other layer parameters of the generated network are fixed and updated separately for that layer. Finally, when updating the output layer of the generative network, the parameters of the entire generative network are updated to gradually guide the generator to learn and synthesize the target image. The second stage is the learning stage, in which the generation network is fixed and the synthetic samples are inputted into the teacher and student networks for prediction. The target knowledge of the teacher is extracted by the mask, and Kullback-Leibler(KL) divergence is used to calculate the predicted output of the student network to update the student network.
Result
2
Four standard image classification datasets, namely, MNIST, SVHN, CIFAR10, and CIFAR100, are divided into 13 subclassification tasks by Pearson similarity calculation, including eight difficult tasks and five easy tasks. The performance of MDTC on subclassification tasks with different difficulties is evaluated by classification accuracy. The method is also compared with five mainstream data-free knowledge distillation methods and the vanilla KD method. Experimental results show that the proposed method outperforms the other mainstream data-free distillation models on 11 subtasks. Moreover, in MNIST
1
, MNIST
2
, SVHN
1
, SVHN
3
, CIFAR10
2
, and CIFAR10
4
(6 of the 13 subclassification tasks), the proposed method even surpasses the teacher model trained on the original data, achieving accuracy rates of 99.61%, 99.46%, 95.85%, 95.80%, 94.57%, and 95.00%, demonstrating a remarkable 3.6% improvement over the 91.40% accuracy of the teacher network in CIFAR10
4
.
Conclusion
2
In this study, a novel data-free knowledge distillation method, MDTC, is proposed. The experimental results indicate that MDTC outperforms existing data-free distillation models overall, especially in efficiently learning knowledge for easy sample classification tasks and when knowledge classes have a low proportion. The method displays excellent performance when extracting knowledge from a limited set of categories.
Arjovsky M , Chintala S and Bottou L . 2017 . Wasserstein generative adversarial networks // Proceedings of the 34th International Conference on Machine Learning . Sydney, Australia : JMLR.org: 214 - 223
Bailey K and Chopra S . 2018 . Few-shot text classification with pre-trained word embeddings and a human in the loop [EB/OL]. [ 2023-12-12 ]. https://arxiv.org/pdf/1804.02063.pdf https://arxiv.org/pdf/1804.02063.pdf
Chen H T , Wang Y H , Xu C , Yang Z H , Liu C J , Shi B X , Xu C J , Xu C and Tian Q . 2019 . Data-free learning of student networks // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul, Korea (South) : IEEE: 3513 - 3521 [ DOI: 10.1109/iccv.2019.00361 http://dx.doi.org/10.1109/iccv.2019.00361 ]
Choi Y , Choi J , El-Khamy M and Lee J . 2020 . Data-free network quantization with adversarial knowledge distillation // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Seattle, USA : IEEE: 3047 - 3057 [ DOI: 10.1109/cvprw50498.2020.00363 http://dx.doi.org/10.1109/cvprw50498.2020.00363 ]
Dai C X , Cao Y D , Zhu G M , Shen P Y , Xu X , Mei L and Zhang L . 2021 . Specific knowledge learning based on knowledge distillation . Journal of Computer Applications , 41 ( 12 ): 3426 - 3431
戴朝霞 , 曹堉栋 , 朱光明 , 沈沛意 , 徐旭 , 梅林 , 张亮 . 2021 . 基于知识蒸馏的特定知识学习 . 计算机应用 , 41 ( 12 ): 3426 - 3431 [ DOI: 10.11772/j.issn.1001-9081.2021060923 http://dx.doi.org/10.11772/j.issn.1001-9081.2021060923 ]
Ding X H , Hao T X , Tan J C , Liu J , Han J G , Guo Y C and Ding G G . 2021 . ResRep: lossless cnn pruning via decoupling remembering and forgetting // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 4490 - 4500 [ DOI: 10.1109/iccv48922.2021.00447 http://dx.doi.org/10.1109/iccv48922.2021.00447 ]
Fang G F , Song J , Shen C C , Wang X C , Chen D and Song M L . 2019 . Data-free adversarial distillation [EB/OL]. [ 2023-12-12 ]. https://arxiv.org/pdf/1912.11006.pdf https://arxiv.org/pdf/1912.11006.pdf
Fang G F , Mo K Y , Wang X C , Song J , Bei S T , Zhang H F and Song M L . 2022 . Up to 100 × faster data-free knowledge distillation // Proceedings of the 36th AAAI Conference on Artificial Intelligence . Palo Alto : AAAI Press: 6597 - 6604 [ DOI: 10.1609/aaai.v36i6.20613 http://dx.doi.org/10.1609/aaai.v36i6.20613 ]
Geng Z M , Yu M Q , Liu X B and Lyu C . 2020 . Combining attention mechanism and knowledge distillation for Siamese network compression . Journal of Image and Graphics , 25 ( 12 ): 2563 - 2577
耿增民 , 余梦巧 , 刘峡壁 , 吕超 . 2020 . 融合注意力机制与知识蒸馏的孪生网络压缩 . 中国图象图形学报 , 25 ( 12 ): 2563 - 2577 [ DOI: 10.11834/jig.200051 http://dx.doi.org/10.11834/jig.200051 ]
Goodfellow I , Pouget-Abadie J , Mirza M , Xu B , Warde-Farley D , Ozair S , Courville A and Bengio Y . 2020 . Generative adversarial networks . Communications of the ACM , 63 ( 11 ): 139 - 144 [ DOI: 10.1145/3422622 http://dx.doi.org/10.1145/3422622 ]
Gulrajani I , Ahmed F , Arjovsky M , Dumoulin V and Courville A C . 2017 . Improved training of wasserstein GANs // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach, USA : Curran Associates Inc.
Hinton G , Vinyals O and Dean J . 2015 . Distilling the knowledge in a neural network [EB/OL]. [ 2023-12-12 ]. https://arxiv.org/pdf/1503.02531.pdf https://arxiv.org/pdf/1503.02531.pdf
Hou Y N , Ma Z , Liu C X and Loy C C . 2019 . Learning lightweight lane detection cnns by self attention distillation // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul, Korea (South) : IEEE: 1013 - 1021 [ DOI: 10.1109/iccv.2019.00110 http://dx.doi.org/10.1109/iccv.2019.00110 ]
Howard J and Ruder S . 2018 . Universal language model fine-tuning for text classification [EB/OL]. [ 2023-12-12 ]. https://arxiv.org/pdf/1801.06146.pdf https://arxiv.org/pdf/1801.06146.pdf
Huang Z H , Yang S Z , Lin W , Ni J , Sun S L , Chen Y W and Tang Y . 2022 . Knowledge distillation: a survey . Chinese Journal of Computers , 45 ( 3 ): 624 - 653
黄震华 , 杨顺志 , 林威 , 倪娟 , 孙圣力 , 陈运文 , 汤庸 . 2022 . 知识蒸馏研究综述 . 计算机学报 , 45 ( 3 ): 624 - 653 [ DOI: 10.11897/SP.J.1016.2022.00624 http://dx.doi.org/10.11897/SP.J.1016.2022.00624 ]
Jiang X , Havaei M , Chartrand G , Chouaib H , Vincent T , Jesson A , Chapados N and Matwin S . 2018 . On the importance of attention in meta-learning for few-shot text classification [EB/OL]. [ 2023-12-12 ]. https://arxiv.org/pdf/1806.00852.pdf https://arxiv.org/pdf/1806.00852.pdf
Karras T , Aila T , Laine S and Lehtinen J . 2017 . Progressive growing of gans for improved quality , stability, and variation// Proceedings of the 6th International Conference on Learning Representations . Vancouver, Canada : OpenReview.net
Li J R , Zhou S , Li L C , Wang H S , Bu J J and Yu Z . 2023 . Dynamic data-free knowledge distillation by easy-to-hard learning strategy . Information Sciences , 642 : #119202 . [ DOI: 10.2139/ssrn.4361656 http://dx.doi.org/10.2139/ssrn.4361656 ]
Lopes R G , Fenu S and Starner T . 2017 . Data-free knowledge distillation for deep neural networks [EB/OL]. [ 2023-12-12 ]. https://arxiv.org/pdf/1710.07535.pdf https://arxiv.org/pdf/1710.07535.pdf
Micaelli P and Storkey A J . 2019 . Zero-shot knowledge transfer via adversarial belief matching // Proceedings of the 33rd International Conference on Neural Information Processing Systems . Vancouver, Canada : Curran Associates Inc
Mirzadeh S I , Farajtabar M , Li A , Levine N , Matsukawa A and Ghasemzadeh H . 2020 . Improved knowledge distillation via teacher assistant // Proceedings of the 34th AAAI Conference on Artificial Intelligence . Palo Alto, USA : AAAI Press: 5191 - 5198 [ DOI: 10.1609/aaai.v34i04.5963 http://dx.doi.org/10.1609/aaai.v34i04.5963 ]
Nayak G K , Mopuri K R , Shaj V , Babu V , Radhakrishnan V B and Chakraborty A . 2019 . Zero-shot knowledge distillation in deep network // Proceedings of the 36th International Conference on Machine Learning . Long Beach, USA : PMLR: 4743 - 4751
Passalis N , Tzelepi M and Tefas A . 2020 . Heterogeneous knowledge distillation using information flow modeling // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 2339 - 2348 [ DOI: 10.1109/cvpr42600.2020.00241 http://dx.doi.org/10.1109/cvpr42600.2020.00241 ]
Shao R R , Liu Y A , Zhang W and Wang J . 2022 . A survey of knowledge distillation in deep learning . Chinese Journal of Computers , 45 ( 8 ): 1638 - 1673
邵仁荣 , 刘宇昂 , 张伟 , 王骏 . 2022 . 深度学习中知识蒸馏研究综述 . 计算机学报 , 45 ( 8 ): 1638 - 1673 [ DOI: 10.11897/SP.J.1016.2022.01638 http://dx.doi.org/10.11897/SP.J.1016.2022.01638 ]
Shen C C , Wang X C , Song J , Sun L and Song M L . 2019 . Amalgamating knowledge towards comprehensive classification // Proceedings of the 33rd AAAI Conference on Artificial Intelligence . Honolulu, USA : AAAI Press: 3068 - 3075 [ DOI: 10.1609/aaai.v33i01.33013068 http://dx.doi.org/10.1609/aaai.v33i01.33013068 ]
Tang R , Lu Y , Liu L Q , Mou L L , Vechtomova O and Lin J . 2019 . Distilling task-specific knowledge from bert into simple neural networks [EB/OL]. [ 2023-12-12 ]. https://arxiv.org/pdf/1903.12136.pdf https://arxiv.org/pdf/1903.12136.pdf
Wang Y H , Xu C , Xu C and Tao D C . 2018 . Adversarial learning of portable student networks // Proceedings of the 32nd AAAI Conference on Artificial Intelligence . New Orleans, USA : AAAI Press [ DOI: 10.1609/aaai.v32i1.11667 http://dx.doi.org/10.1609/aaai.v32i1.11667 ]
West P , Bhagavatula C , Hessel J , Hwang J D , Jiang L W , Le Bras R , Lu X M , Welleck S and Choi Y . 2021 . Symbolic knowledge distillation: from general language models to commonsense models // Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Seattle, USA : ACL: #341 [ DOI: 10.18653/v1/2022.naacl-main.341 http://dx.doi.org/10.18653/v1/2022.naacl-main.341 ]
Xu D , Ouyang W L , Wang X G and Sebe N . 2018 . PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 675 - 684 [ DOI: 10.1109/cvpr.2018.00077 http://dx.doi.org/10.1109/cvpr.2018.00077 ]
Ye J W , Ji Y X , Wang X C , Ou K R , Tao D P and Song M L . 2019 . Student becoming the master: knowledge amalgamation for joint scene parsing, depth estimation, and more // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach, USA : IEEE: 2824 - 2833 [ DOI: 10.1109/cvpr.2019.00294 http://dx.doi.org/10.1109/cvpr.2019.00294 ]
Yi Z L , Zhang H , Tan P and Gong M L . 2017 . DualGAN: unsupervised dual learning for image-to-image translation // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice, Italy : IEEE: 2868 - 2876 [ DOI: 10.1109/iccv.2017.310 http://dx.doi.org/10.1109/iccv.2017.310 ]
Yin H X , Molchanov P , Alvarez J M , Li Z Z , Mallya A , Hoiem D , Jha N K and Kautz J . 2020 . Dreaming to distill: data-free knowledge transfer via deepinversion // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Virtual : IEEE: 2020 : 8712 - 8721 [ DOI: 10.1109/cvpr42600.2020.00874 http://dx.doi.org/10.1109/cvpr42600.2020.00874 ]
Yu M , Guo X X , Yi J F , Chang S Y , Potdar S , Cheng Y , Tesauro G , Wang H Y and Zhou B W . 2018 . Diverse few-shot text classification with multiple metrics // Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . New Orleans : ACL: #1109 [ DOI: 10.18653/v1/N18-1109 http://dx.doi.org/10.18653/v1/N18-1109 ]
Yuan L , Tay F E , Li G L , Wang T and Feng J S . 2020 . Revisiting knowledge distillation via label smoothing regularization // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 3902 - 3910 [ DOI: 10.1109/cvpr42600.2020.00396 http://dx.doi.org/10.1109/cvpr42600.2020.00396 ]
Yun S , Park J , Lee K and Shin J . 2020 . Regularizing class-wise predictions via self-knowledge distillation // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 13873 - 13882 [ DOI: 10.1109/CVPR42600.2020.01389 http://dx.doi.org/10.1109/CVPR42600.2020.01389 ]
Zhang H , Xu T , Li H S , Zhang S T , Wang X G , Huang X L and Metaxas D N . 2017 . StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice, Italy : IEEE: 5908 - 5916 [ DOI: 10.1109/iccv.2017.629 http://dx.doi.org/10.1109/iccv.2017.629 ]
Zhao H R , Sun X , Dong J Y , Manic M , Zhou H Y and Yu H . 2022 . Dual discriminator adversarial distillation for data-free model compression . International Journal of Machine Learning and Cybernetics , 13 ( 5 ): 1213 - 1230 [ DOI: 10.1007/s13042-021-01443-0 http://dx.doi.org/10.1007/s13042-021-01443-0 ]
相关作者
相关机构