视觉Transformer预训练模型的胸腔X线影像多标签分类

邢素霞; 鞠子涵; 刘子骄; 王瑜; 范福强

doi:10.11834/jig.220284

医学图像处理 | 浏览量 : 0 下载量: 2 CSCD: 1

PDF
导出
分享
收藏
专辑

视觉Transformer预训练模型的胸腔X线影像多标签分类
Multi-label classification of chest X-ray images with pre-trained vision Transformer model
2023年28卷第4期页码：1186-1197
纸质出版日期： 2023-04-16 ，
DOI： 10.11834/jig.220284
稿件说明：

移动端阅览

邢素霞，鞠子涵，刘子骄，王瑜，范福强. 2023. 视觉Transformer预训练模型的胸腔X线影像多标签分类. 中国图象图形学报， 28(04):1186-1197

Xing Suxia， Ju Zihan， Liu Zijiao， Wang Yu， Fan Fuqiang. 2023. Multi-label classification of chest X-ray images with pre-trained vision Transformer model. Journal of Image and Graphics， 28(04):1186-1197
邢素霞，鞠子涵，刘子骄，王瑜，范福强. 2023. 视觉Transformer预训练模型的胸腔X线影像多标签分类. 中国图象图形学报， 28(04):1186-1197 DOI： 10.11834/jig.220284.

Xing Suxia， Ju Zihan， Liu Zijiao， Wang Yu， Fan Fuqiang. 2023. Multi-label classification of chest X-ray images with pre-trained vision Transformer model. Journal of Image and Graphics， 28(04):1186-1197 DOI： 10.11834/jig.220284.

摘要

目的

基于计算机的胸腔X线影像疾病检测和分类目前存在误诊率高，准确率低的问题。本文在视觉Transformer（vision Transformer，ViT）预训练模型的基础上，通过迁移学习方法，实现胸腔X线影像辅助诊断，提高诊断准确率和效率。

方法

选用带有卷积神经网络（convolutional neural network，CNN）的ViT模型，其在超大规模自然图像数据集中进行了预训练；通过微调模型结构，使用预训练的ViT模型参数初始化主干网络，并迁移至胸腔X线影像数据集中再次训练，实现疾病多标签分类。

结果

在IU X-Ray数据集中对ViT迁移学习前、后模型平均AUC（area under ROC curve）得分进行对比分析实验。结果表明，预训练ViT模型平均AUC得分为0.774，与不使用迁移学习相比提升了0.208。并针对模型结构和数据预处理进行了消融实验，对ViT中的注意力机制进行可视化，进一步验证了模型有效性。最后使用Chest X-Ray14和CheXpert数据集训练微调后的ViT模型，平均AUC得分为0.839和0.806，与对比方法相比分别有0.014～0.031的提升。

结论

与其他方法相比，ViT模型胸腔X线影像的多标签分类精确度更高，且迁移学习可以在降低训练成本的同时提升ViT模型的分类性能和泛化性。消融实验与模型可视化表明，包含CNN结构的ViT模型能重点关注有意义的区域，高效获取胸腔X线影像的视觉特征。

Abstract

Objective

The chest X-ray-relevant screening and diagnostic method is essential for radiology nowadays. Most of chest X-ray images interpretation is still restricted by clinical experience and challenged for misdiagnose and missed diagnoses. To detect and identify one or more potential diseases in images automatically， it is beneficial for improving diagnostic efficiency and accuracy using computer-based technique. Compared to natural images， multiple lesions are challenged to be detected and distinguished accurately in a single image because abnormal areas have a small proportion and complex representations in chest X-ray images. Current convolutional neural network（CNN） based deep learning models have been widely used in the context of medical imaging. The structure of the CNN convolution kernel has sensitive to local detail information， and it is possible to extract richer image features. However， the convolution kernel cannot be used to get global information， and the features-extracted are restricted of redundant information like its relevance of background， muscles， and bones. The model’s performance in multi-label classification tasks are affected to a certain extent. At present， the vision Transformer （ViT） model has achieved its priorities in computer vision-related tasks. The ViT can be used to capture information simultaneously and effectively for multiple regions of the entire image. However， it is required to use large-scale dataset training to achieve good performance. Due to some factors like patient privacy and manual annotate costs， the size of the chest X-ray image data set has been limited. To reduce the model's dependence on data scale and improve the performance of multi-label classification， we develop the CNN-based ViT pre-training model in terms of the transfer learning method for diagnosis-assisted of chest X-ray image and multi-label classification.

Method

The CNN-based ViT model is pre-trained on a huge scale ground truth dataset， and it is used to obtain the initial parameters of the model. The model structure is fine-tuned according to the features of chest X-ray dataset. A 1 × 1 convolution layer is used to convert the chest X-ray images channels between 1 to 3. The number of output nodes of the linear layer in the classifier is balanced from 1 000 to the number of chest X-ray classification labels， and the Sigmoid is used as an activation function. The parameters of the backbone network are initialized in terms of the pre-trained ViT model parameters， and it is trained in the chest X-ray dataset after that to complete multi-label classification. The experiment is configured of Python3.7 and PyTorch1.8 to construct the model and RTX3090 GPU for training. Stochastic gradient descent（SGD） optimizer， binary cross-entropy（BCE） loss function， an initial learning rate of 1E-3， the cosine annealing learning rate decay are used. For training， each image is scaled to a size of 512 × 512 pixels， and a 224 × 224 pixels area and it is then cropped in random as the model input， and data augmentation is performed randomly by some of the flipping， perspective transformation， shearing， translation， zooming， and changing brightness. For testing， the chest X-ray image is scaled to 256 × 256 pixels and center crop a 224 ×224 area to input the trained model.

Result

The experiment is performed on the IU X-Ray， which is a small-scale chest X-ray dataset. This model is evaluated in quantitative using the average of area under ROC curve （AUC） scores across all classification labels. The results show that the average AUC score of the pre-trained ViT model is 0.774. The accuracy and training efficiency of the non-pre-trained ViT model is dropped significantly. The average AUC score is reached to 0.566 only， which is 0.208 lower. In addition， the attention mechanism heat map is generated based on the ViT model， which can strengthen the interpretability of the model. A series of ablation experiments are carried out for data augmentation， model structure， and batch size design. The fine-tuned ViT model is trained on the Chest-Ray14 and CheXpert dataset as well. The average AUC score is reached to 0.839 and 0.806， which is optimized by 0.014 and 0.031.

Conclusion

A pre-trained ViT model is used for the multi-label classification of chest X-ray images via transfer learning. The experimental results illustrate that the ViT has its stronger multi-label classification performance in chest X-ray images， and its attention mechanism is beneficial for lesions precision-focused like the interior of the chest cavity and the heart. Transfer learning is potential to improve the classification performance and model generalization of the ViT in small-scale datasets， and the training cost is reduced greatly. Ablation experiments demonstrate that the incorporated model of CNN and Transformer has its priority beyond single-structure model. Data enhancement and the batch size cutting can improve the performance of the model， but smaller scale of batch is still interlinked to longer training span. To improve the model's ability， we predict that future research direction can be focused on the extraction for complex disease and high-level semantic information， such as their small lesions， disease location， and severity.

关键词

胸腔X线影像多标签分类卷积神经网络（CNN）视觉Transformer（ViT）迁移学习

Keywords

chest X-ray imagesmulti-label classificationconvolutional neural network（CNN）vision Transformer（ViT）transfer learning

references

Abbas A， Abdelsamea M M and Gaber M M. 2021. Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network. Applied Intelligence， 51（2）： 854-864 ［DOI： 10.1007/s10489-020-01829-7http://dx.doi.org/10.1007/s10489-020-01829-7］

Alfarghaly O， Khaled R， Elkorany A， Helal M and Fahmy A. 2021. Automated radiology report generation using conditioned transformers. Informatics in Medicine Unlocked， 24： #100557 ［DOI： 10.1016/j.imu.2021.100557http://dx.doi.org/10.1016/j.imu.2021.100557］

Chen H Y， Gao J Y， Zhao D， Wang H Z， Song H and Su Q H. 2021. Review of the research progress in deep learning and biomedical image analysis till 2020. Journal of Image and Graphics， 26（3）： 475-486

陈弘扬，高敬阳，赵地，汪红志，宋红，苏庆华. 2021. 深度学习与生物医学图像分析2020年综述. 中国图象图形学报， 26（3）： 475-486 ［DOI： 10.11834/ jig.200351http://dx.doi.org/10.11834/jig.200351］

Chen J N， Lu Y Y， Yu Q H， Luo X D， Adeli E， Wang Y， Lu L， Yuille A L and Zhou Y Y. 2021. TransUNet： transformers make strong encoders for medical image segmentation［EB/OL］. ［2022-03-02］. https://arxiv.53yu.com/pdf/2102.04306.pdfhttps://arxiv.53yu.com/pdf/2102.04306.pdf

Dai Y， Gao Y F and Liu F Y. 2021. TransMed： transformers advance multi-modal medical image classification. Diagnostics， 11（8）： #1384 ［DOI： 10.3390/diagnostics11081384http://dx.doi.org/10.3390/diagnostics11081384］

Demner-Fushman D， Kohli M D， Rosenman M B， Shooshan S E， Rodriguez L， Antani S， Thoma G R and McDonald C J. 2016. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association， 23（2）： 304-310 ［DOI： 10.1093/jamia/ocv080http://dx.doi.org/10.1093/jamia/ocv080］

Deng J， Dong W， Socher R， Li L J， Li K and Li F F. 2009. ImageNet： a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami， USA： IEEE： 248-255 ［DOI： 10.1109/cvpr.2009.5206848http://dx.doi.org/10.1109/cvpr.2009.5206848］

Dong Y H， Cordonnier J B and Loukas A. 2021. Attention is not all you need： pure attention loses rank doubly exponentially with depth［EB/OL］. ［2022-03-02］. https://arxiv.org/pdf/2103.03404.pdfhttps://arxiv.org/pdf/2103.03404.pdf

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： transformers for image recognition at scale［EB/OL］. ［2022-03-02］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/cvpr.2016.90http://dx.doi.org/10.1109/cvpr.2016.90］

He T， Zhang Z， Zhang H， Zhang Z Y， Xie J Y and Li M. 2019. Bag of tricks for image classification with convolutional neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 558-567 ［DOI： 10.1109/cvpr.2019.00065http://dx.doi.org/10.1109/cvpr.2019.00065］

Hendrycks D and Gimpel K. 2020. Gaussian error linear units （GELUS）［EB/OL］. ［2022-03-02］. https://arxiv.53yu.com/pdf/1606.08415.pdfhttps://arxiv.53yu.com/pdf/1606.08415.pdf

Huang G， Liu Z， van der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 2261-2269 ［DOI： 10.1109/cvpr.2017.243http://dx.doi.org/10.1109/cvpr.2017.243］

Iandola F N， Han S， Moskewicz M W， Ashraf K， Dally W J and Keutzer K. 2016. SqueezeNet： AlexNet-level accuracy with 50 × fewer parameters and < 0.5 MB model size［EB/OL］. ［2022-03-02］. https://arxiv.org/pdf/1602.07360.pdfhttps://arxiv.org/pdf/1602.07360.pdf

Irvin J， Rajpurkar P， Ko M， Yu Y F， Ciurea-Ilcus S， Chute C， Markand H， Haghgoo B， Ball R， Shpanskaya K， Seekins J， Mong D A， Halabi S S， Sandberg J K， Jones R， Larson D B， Lanflotz C P， Patel B N， Lungren M P and Ng A Y. 2019. CheXpert： a large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI Conference on Artificial Intelligence， 33（1）： 590-597 ［DOI： 10.1609/aaai.v33i01.3301590http://dx.doi.org/10.1609/aaai.v33i01.3301590］

Khanh Ho T K and Gwak J. 2019. Multiple feature integration for classification of thoracic disease in chest radiography. Applied Sciences， 9（19）： #4130［DOI：10.3390/app9194130http://dx.doi.org/10.3390/app9194130］

Krizhevsky A， Sutskever I and Hinton G E. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM， 60（6）： 84-90 ［DOI： 10.1145/3065386http://dx.doi.org/10.1145/3065386］

Liu F B， Tian Y， Cordeiro F R， Belagiannis V， Reid I and Carneiro G. 2021. Self-supervised mean teacher for semi-supervised chest X-ray classification//12th International Workshop on Machine Learning in Medical Imaging. Strasbourg， France： Springer： 426-436 ［DOI： 10.1007/978-3-030-87589-3_44http://dx.doi.org/10.1007/978-3-030-87589-3_44］

Ma C B， Wang H and Hoi S C H. 2020. Multi-label thoracic disease image classification with cross-attention networks ［EB/OL］. ［2022-05-12］. https://arxiv.org/pdf/2007.10859.pdfhttps://arxiv.org/pdf/2007.10859.pdf

Pan H W， Li P Y， Han Q L， Xie X Q， Zhang Z Q and Gao L L. 2013. A novel model for medical image modeling and similarity retrieval. Chinese Journal of Computers， 36（8）： 1745-1756

潘海为，李鹏远，韩启龙，谢晓芹，张志强，高琳琳. 2013. 一种新颖的医学图像建模及相似性搜索方法. 计算机学报， 36（8）： 1745-1756 ［DOI： 10.3724/SP.J.1016.2013.01745http://dx.doi.org/10.3724/SP.J.1016.2013.01745］

Rahman T， Chowdhury M E H， Khandakar A， Islam K R， Islam K F， Mahbub Z B， Kadir M A and Kashem S. 2020. Transfer learning with deep convolutional neural network （CNN） for pneumonia detection using chest X-ray. Applied Sciences， 10（9）： #3233 ［DOI： 10.3390/app10093233http://dx.doi.org/10.3390/app10093233］

Rajpurkar P， Irvin J， Zhu K， Yang B， Mehta H， Duan T， Ding D， Bagul A， Ball R L， Langlotz C， Shpanskaya K， Lungren M P and Ng A Y. 2017. CheXNet： radiologist-level pneumonia detection on chest X-rays with deep learning ［EB/OL］. ［2022-03-02］. https://arxiv.org/pdf/1711.05225.pdfhttps://arxiv.org/pdf/1711.05225.pdf

Ronneberger O， Fischer P and Brox T. 2015. U-Net： convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich， Germany： Springer： 234-241 ［DOI： 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28］

Shi J， Wang L L， Wang S S， Chen Y X， Wang Q， Wei D M， Liang S J， Peng J L， Yi J J， Liu S F， Ni D， Wang M L， Zhang D Q and Shen D G. 2020. Applications of deep learning in medical imaging： a survey. Journal of Image and Graphics， 25（10）： 1953-1981

施俊，汪琳琳，王珊珊，陈艳霞，王乾，魏冬铭，梁淑君，彭佳林，易佳锦，刘盛锋，倪东，王明亮，张道强，沈定刚. 2020. 深度学习在医学影像中的应用综述. 中国图象图形学报， 25（10）： 1953-1981 ［DOI： 10.11834/jig 200255http://dx.doi.org/10.11834/jig200255］

Shiraishi J， Li Q， Appelbaum D and Doi K. 2011. Computer-aided diagnosis and artificial intelligence in clinical imaging. Seminars in Nuclear Medicine， 41（6）： 449-462 ［DOI： 10.1053/j.semnuclmed.2011.06.004http://dx.doi.org/10.1053/j.semnuclmed.2011.06.004］

Sun C， Shrivastava A， Singh S and Gupta A. 2017. Revisiting unreasonable effectiveness of data in deep learning era//Proceedngs of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 843-852 ［DOI： 10.1109/iccv.2017.97http://dx.doi.org/10.1109/iccv.2017.97］

Touvron H， Cord M， Douze M， Massa F， Sablayrolles A and Jégou H. 2021. Training data-efficient image transformers and distillation through attention［EB/OL］. ［2022-03-02］. https://arxiv.org/pdf/2012.12877.pdfhttps://arxiv.org/pdf/2012.12877.pdf

Wang X S， Pwng Y F， Lu L， Lu Z Y， Bagheri M and Summers R M. 2017. ChestX-Ray8： hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington， USA： IEEE： 3462-3471 ［DOI： 10.1109/CVPR.2017.369http://dx.doi.org/10.1109/CVPR.2017.369］

Xie Y T， Zhang J P， Shen C H and Xia Y. 2021. CoTr： efficiently bridging CNN and transformer for 3D medical image segmentation//Proceedings of the 24th International Conference on Medical Image Computing and Computer-Assisted Intervention. Strasbourg， France： Springer： 171-180 ［DOI： 10.1007/978-3-030-87199-4_16http://dx.doi.org/10.1007/978-3-030-87199-4_16］

Zhang Z R， Li Q and Guan X. 2020. Multilabel chest X-ray disease classification based on a dense squeeze-and-excitation network. Journal of Image and Graphics， 25（10）： 2238-2248

张智睿，李锵，关欣. 2020. 密集挤压激励网络的多标签胸部X光片疾病分类. 中国图象图形学报， 25（10）： 2238-2248 ［DOI： 10.11834/jig.200232http://dx.doi.org/10.11834/jig.200232］

文章被引用时，请邮件提醒。

提交

融合帧间时序关系的标准胎儿四腔心超声切面自动获取