结合知识增强和特征对齐的胸片图像报告生成
Knowledge enhancement and feature alignment for chest radiology images report generation
- 2024年 页码:1-16
网络出版日期: 2024-10-01
DOI: 10.11834/jig.240536
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-10-01 ,
移动端阅览
符杰,刘骊,付晓东等.结合知识增强和特征对齐的胸片图像报告生成[J].中国图象图形学报,
FU Jie,LIU Li,FU Xiaodong,et al.Knowledge enhancement and feature alignment for chest radiology images report generation[J].Journal of Image and Graphics,
目的
2
针对胸片X-Ray图像报告生成中图像文本的语义鸿沟、疾病种类的复杂多样以及诊断报告的偏差缺失导致的表征不精确、特征不匹配、结果不准确等问题,提出一种结合知识增强和特征对齐的胸片图像报告生成方法。
方法
2
该方法包括图像和文本特征表示、知识增强视觉特征学习和全局-局部特征对齐三个模块。首先输入胸片图像和文本报告,通过构建包含视觉和文本编码器的图像和文本特征表示模块,分别提取图像和文本的全局特征和局部特征;然后,引入胸部先验知识图谱,通过病理图知识编码进行知识增强视觉特征学习,得到融合后的增强视觉特征;最后,定义交叉注意力对图像文本的全局-局部特征和视觉-疾病标签进行跨模态特征对齐,通过多头注意力编解码生成准确的胸片图像报告。
结果
2
为了验证方法的有效性,在两个具挑战性的数据集IU X-Ray和MIMIC-CXR进行对比实验,结果表明,本文方法在IU X-Ray数据集中,BLEU-1、3、4指标分别达到了0.505、0.235和0.178,较现有大多数同任务方法有所提升;在MIMIC-CXR数据集中与其他多种方法相比,BLUE-2、3指标分别提升了0.4%和1.2%,说明本文方法具有较大优势。
结论
2
本文提出的胸片图像报告生成方法,能捕获图像和文本的细节特征,聚焦全局-局部特征以及疾病类别间的关联,提高了图像与文本的匹配度,能够生成完整准确的医学报告。<p>
目的
2
针对胸片X-Ray图像报告生成中图像文本的语义鸿沟、疾病种类的复杂多样以及诊断报告的偏差缺失导致的表征不精确、特征不匹配、结果不准确等问题,提出一种结合知识增强和特征对齐的胸片图像报告生成方法。</p><p>
方法
2
该方法包括图像和文本特征表示、知识增强视觉特征学习和全局-局部特征对齐三个模块。首先输入胸片图像和文本报告,通过构建包含视觉和文本编码器的图像和文本特征表示模块,分别提取图像和文本的全局特征和局部特征;然后,引入胸部先验知识图谱,通过病理图知识编码进行知识增强视觉特征学习,得到融合后的增强视觉特征;最后,定义交叉注意力对图像文本的全局-局部特征和视觉-疾病标签进行跨模态特征对齐,通过多头注意力编解码生成准确的胸片图像报告。</p><p>
结果
2
为了验证方法的有效性,在两个具挑战性的数据集IU X-Ray和MIMIC-CXR进行对比实验,结果表明,本文方法在IU X-Ray数据集中,BLEU-1、3、4指标分别达到了0.505、0.235和0.178,较现有大多数同任务方法有所提升;在MIMIC-CXR数据集中与其他多种方法相比,BLUE-2、3指标分别提升了0.4%和1.2%,说明本文方法具有较大优势。</p><p>
结论
2
本文提出的胸片图像报告生成方法,能捕获图像和文本的细节特征,聚焦全局-局部特征以及疾病类别间的关联,提高了图像与文本的匹配度,能够生成完整准确的医学报告。</p>
Objective
2
Medical report generation leverages natural language processing and machine learning techniques to convert medical data, such as medical images, into structured text reports. This aims to improve efficiency in the medical field and reduce the worklord of healthcare professionals. Annually, over 3.6 billion X-ray imaging examinations are conducted worldwide, with the majority being chest X-ray, and this number continues rise. This trend places increasing pressure on radiologists, impacting both the quality and speed of clinical decision-making. Inspired by various advanced general image captioning methods, current medical report generation methods can be broadly categorized into three types based on their implementation techniques: encoder-decoder methods, cross-modal alignment methods, and knowledge-enhanced methods. Among these, encoder-decoder methods, particularly those utilizing the Transformer architecture, are the most widely adopted. The Transformer architecture excels at encoding long-term dependencies and learning efficient feature representations, making it very suitable for medical report generation tasks. However, most of these methods employ an end-to-end approach than generates only diagnostic reports without classifying the types of chest diseases, thus lacking the assistance of disease label semantic information. Despite significant progress in medical report generation research driven by advances in deep learning, several challenges remain. First, there exists a semantic gap between medical images and text reports. Most existing methods focus only on the aligning prominent information in the images and text, often neglecting the interaction of fine-grained information, which hampers the association between local information in detailed image regions and text reports. Second, the complexity and diversity of organs and diseases in chest X-ray images, along with special cases such as complications and disease coexistence, make it difficult to generate accurate text reports. Additionally, key clinical information in medical report generation often derives from descriptions of abnormalities, and the absence of abnormalities in both the images and reports frequently leads models to generate normal and similar reports. To address these issues, this paper proposes a chest X-ray image report generation method that combines knowledge enhancement and feature alignment. The approach includes three main modules: first, a feature representation module extracts more detailed features from text reports and chest images ; second, Knowledge-enhanced visual features incorporate prior medical knowledge, guiding the learning of visual features; and third, global-local feature alignment promotes semantic alignment between images, reports, and disease labels, thereby improving the accuracy and completeness of the generated reports.
Method
2
First, chest X-ray images and text reports are input, and an image and text feature representation module is constructed, which includes visual and textual encoders to extract global and local features from the images and text, respectively. Then, a chest prior knowledge graph is introduced, enabling knowledge-enhanced visual feature learning through the encoding of pathological image knowledge, resulting in enhanced visual features after fusion. Finally, cross-attention is defined to align the global-local features of the image and text with the visual-disease labels across modalities, and multi-head attention is used in the encoder-decoder to generate accurate chest X-ray image reports.
Results
2
The effectiveness of the proposed method was validated through comparative experiments on two challenging datasets, IU X-Ray and MIMIC-CXR. The results show that in the IU X-Ray dataset, the BLEU-1, 3, and 4 scores reached 0.505, 0.235, and 0.178, respectively, representing an improvement over most existing methods for the same task. This indicates that the proposed model offers certain advantages in text fluency and better focus on disease regions, generating correct label information, significantly improving various report metrics. The CIDEr and ROUGE-L scores are also comparable to those of other methods, demonstrating that knowledge enhancement and feature alignment positively impact the quality of the generated reports. On the MIMIC-CXR dataset, the BLEU-2 and 3 metrics increased by 0.4% and 1.2%, respectively, proving the robustness of the model, which maintains its advantages even when faced with more complex and diverse data. The CE metrics in the MIMIC-CXR dataset also showed improvement, with the precision metric reaching 0.428, recall reaching 0.343, and F1 reaching 0.360, indicating the model's effectiveness in generating complete and consistent reports. Qualitative experiments show that the proposed method generates medical reports largely consistent with reference reports. Additionally, for abnormal images, the model accurately locates the abnormal regions, with the generated reports including not only descriptions of the disease conditions but also describe the lesions. The vocabulary used is professional and closely aligns with the ground truth reports, demonstrating that the knowledge enhancement contributes to the model's ability to generate professional and accurate text. Ablation experiments confirm that the inclusion of image and text feature representation, knowledge enhancement, and feature alignment modules improves report generation compared to the basic model. This suggests that incorporating external knowledge to the medical report generation model and fully utilizing the visual features of similar medical images enables the model learn sufficient visual features of medical images during the encoding stage, guiding the model to generate more complete and accurate medical reports during the decoding stage.
Conclusion
2
The chest X-ray image report generation method proposed in this paper captures detailed features of both images and text, focus on the relationships between global-local features and disease categories, and enhance the alignment between images and text, enabling the generation of complete and accurate medical reports.<p>Objective Medical report generation leverages natural language processing and machine learning techniques to convert medical data, such as medical images, into structured text reports. This aims to improve efficiency in the medical field and reduce the worklord of healthcare professionals. Annually, over 3.6 billion X-ray imaging examinations are conducted worldwide, with the majority being chest X-ray, and this number continues rise. This trend places increasing pressure on radiologists, impacting both the quality and speed of clinical decision-making. Inspired by various advanced general image captioning methods, current medical report generation methods can be broadly categorized into three types based on their implementation techniques: encoder-decoder methods, cross-modal alignment methods, and knowledge-enhanced methods. Among these, encoder-decoder methods, particularly those utilizing the Transformer architecture, are the most widely adopted. This paper proposes a chest X-ray image report generation method that combines knowledge enhancement and feature alignment. The approach includes three main modules: first, a feature representation module extracts more detailed features from text reports and chest images ; second, Knowledge-enhanced visual features incorporate prior medical knowledge, guiding the learning of visual features; and third, global-local feature alignment promotes semantic alignment between images, reports, and disease labels, thereby improving the accuracy and completeness of the generated reports.</p><p>Method First, chest X-ray images and text reports are input, and an image and text feature representation module is constructed, which includes visual and textual encoders to extract global and local features from the images and text, respectively. Then, a chest prior knowledge graph is introduced, enabling knowledge-enhanced visual feature learning through the encoding of pathological image knowledge, resulting in enhanced visual features after fusion. Finally, cross-attention is defined to align the global-local features of the image and text with the visual-disease labels across modalities, and multi-head attention is used in the encoder-decoder to generate accurate chest X-ray image reports.</p><p>Results The effectiveness of the proposed method was validated through comparative experiments on two challenging datasets, IU X-Ray and MIMIC-CXR. The results show that in the IU X-Ray dataset, the BLEU-1, 3, and 4 scores reached 0.505, 0.235, and 0.178, respectively, representing an improvement over most existing methods for the same task. This indicates that the proposed model offers certain advantages in text fluency and better focus on disease regions, generating correct label information, significantly improving various report metrics. The CIDEr and ROUGE-L scores are also comparable to those of other methods, demonstrating that knowledge enhancement and feature alignment positively impact the quality of the generated reports. On the MIMIC-CXR dataset, the BLEU-2 and 3 metrics increased by 0.4% and 1.2%, respectively, proving the robustness of the model, which maintains its advantages even when faced with more complex and diverse data. The CE metrics in the MIMIC-CXR dataset also showed improvement, with the precision metric reaching 0.428, recall reaching 0.343, and F1 reaching 0.360, indicating the model’s effectiveness in generating complete and consistent reports. Qualitative experiments show that the proposed method generates medical reports largely consistent with reference reports. Additionally, for abnormal images, the model accurately locates the abnormal regions, with the generated reports including not only descriptions of the disease conditions but also describe the lesions. The vocabulary used is professional and closely aligns with the ground truth reports, demonstrating that the knowledge enhancement contributes to the model's ability to generate professional and accurate text. Ablation experiments confirm that the inclusion of image and text feature representation, knowledge enhancement, and feature alignment modules improves report generation compared to the basic model. This suggests that incorporating external knowledge to the medical report generation model and fully utilizing the visual features of similar medical images enables the model learn sufficient visual features of medical images during the encoding stage, guiding the model to generate more complete and accurate medical reports during the decoding stage.</p><p>Conclusion The chest X-ray image report generation method proposed in this paper captures detailed features of both images and text, focus on the relationships between global-local features and disease categories, and enhance the alignment between images and text, enabling the generation of complete and accurate medical reports.</p>
胸片图像报告生成全局-局部特征表示知识增强特征学习特征对齐
chest X-ray image report generationglobal-local feature representationknowledge enhancementfeature learningfeature alignment
Bu S, Li T, Yang Y and Dai Z. 2024. Instance-level Expert Knowledge and Aggregate Discriminative Attention for Radiology Report Generation//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle WA, USA: IEEE: 14194-14204
Cao Y, Cui L, Zhang L, Yu F, Li Z and Xu Y. 2023. MMTN: multi-modal memory transformer network for image-report consistent medical report generation//Proceedings of the AAAI Conference on Artificial Intelligence. Washington DC, USA: 37(1): 277-285 [DOI: 10.1609/aaai.v37i1.25100http://dx.doi.org/10.1609/aaai.v37i1.25100]
Chen Z, Shen Y, Song Y and Wan X. 2022. Cross-modal memory networks for radiology report generation[DB/OL].[2024-8-22].https://arxiv.org/pdf/2204.13258https://arxiv.org/pdf/2204.13258
Chen Z, Song Y, Chang T H and Wan X. 2020. Generating radiology reports via memory-driven transformer[DB/OL].[2024-8-22].https://arxiv.org/pdf/2010.16056https://arxiv.org/pdf/2010.16056
Cheng P, Lin L, Lyu J, Huang Y, Luo W and Tang X. 2023. Prior: Prototype representation joint learning from medical images and reports//Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 21361-21371 [DOI: 10.48550/arXiv.2307.12577http://dx.doi.org/10.48550/arXiv.2307.12577]
Dai X, Tuerhong G and Wushouer M. 2023. Globally Guided Confidence Enhancement Network for Image-Text Matching. Applied Sciences,13(9): 5658 [DOI: 10.3390/app13095658http://dx.doi.org/10.3390/app13095658]
Guo Z, Ma J, Xu Y, Wang Y, Wang L and Chen H. 2024. HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction[DB/OL].[2024-8-22]. https://arxiv.org/pdf/2403.05396https://arxiv.org/pdf/2403.05396
Han Q, Zhang S J, Tan L W and Li J S. 2024. Medical Report Generation Method Based on Multi-Scale Feature Fusion and Cross-Training. Journal of Computer-Aided Design & Computer Graphics,36(5): 795-804
韩琪, 张淑军, 谭立玮, 李劲松. 2024. 变尺度特征融合与交叉训练的医学报告生成方法. 计算机辅助设计与图形学学报,36(5):795-804 [DOI: 10.3724/SP.J.1089.2024.19859http://dx.doi.org/10.3724/SP.J.1089.2024.19859]
Li J X, Sun J, Li C and Ahmad B. 2022. A MHA-based integrated diagnosis and segmentation method for COVID-19 pandemic. Journal of Image and Graphics,27(12):3651-3662
李金星, 孙俊, 李超, Bilal Ahmad. 2022. 融合多头注意力机制的新冠肺炎联合诊断与分割. 中国图象图形学报,27(12): 3651-3662 [DOI: 10.11834/jig.211015http://dx.doi.org/10.11834/jig.211015]
Li M, Lin B, Chen Z, Lin H, Liang X and Chang X. 2023. Dynamic graph enhanced contrastive learning for chest x-ray report generation//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: 3334-3343 [DOI: 10.48550/arXiv.2303.10323http://dx.doi.org/10.48550/arXiv.2303.10323]
Liu F, Wu X, Ge S, Fan W and Zou Y. 2021. Exploring and distilling posterior and prior knowledge for radiology report generation// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE: 13753-13762 [DOI: 10.48550/arXiv.2106.06963http://dx.doi.org/10.48550/arXiv.2106.06963]
Liu K, Ma Z, Kang X, Zhong Z, Jiao Z, Baird G, Bai H and Miao Q. 2024. Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation[DB/OL].[2024-8-22]. https://arxiv.org/pdf/2405.14905https://arxiv.org/pdf/2405.14905
Liu K, Ma Z, Liu M, Jiao Z, Kang X, Miao Q and Xie K. 2024. Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation[DB/OL].[2024-8-22]. https://arxiv.org/pdf/2405.09586https://arxiv.org/pdf/2405.09586
Liu L, Huang Q, Lin S, Xie H, Wang B, Chang X and Liang X. 2021. Exploring inter-channel correlation for diversity-preserved knowledge distillation//Proceedings of the IEEE/CVF international conference on computer vision. IEEE: 8271-8280 [DOI: 10.48550/arXiv.2202.03680http://dx.doi.org/10.48550/arXiv.2202.03680]
Lu J, Xiong C, Parikh D and Socher R. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning//Proceedings of the IEEE conference on computer vision and pattern recognition. Honolulu, Hawaii: 375-383 [DOI: 10.48550/arXiv.1612.01887http://dx.doi.org/10.48550/arXiv.1612.01887]
Messina P, Pino P, Parra D, Soto A, Besa C, Uribe S, Andia M, Tejos C, Prieto C and Capurro D. 2022. A survey on deep learning and explainability for automatic report generation from medical images. ACM Computing Surveys,54(10s): 1-40 [DOI: 10.1145/3522747http://dx.doi.org/10.1145/3522747]
Müller P, Kaissis G, Zou C and Rueckert D. 2022. Joint learning of localized representations from medical images and reports//European Conference on Computer Vision. Cham: Springer Nature Switzerland: 685-701 [DOI: 10.1007/978-3-031-19809-0_39http://dx.doi.org/10.1007/978-3-031-19809-0_39]
Shen H, Pei M, Liu J and Tian Z. 2024. Automatic Radiology Reports Generation via Memory Alignment Network//Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: 38(5): 4776-4783 [DOI: 10.1609/aaai.v38i5.28279http://dx.doi.org/10.1609/aaai.v38i5.28279]
Sirshar M, Paracha M F K, Akram M U, Alghamdi N S, Zaidi S Z Y and Fatima T. 2022. Attention based automated radiology report generation using CNN and LSTM. Plos one,17(1): e0262209 [DOI: 10.1371/journal.pone.0262209http://dx.doi.org/10.1371/journal.pone.0262209]
Tanida T, Müller P, Kaissis G and Rueckert D. 2023. Interactive and explainable region-guided radiology report generation//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: 7433-7442 [DOI: 10.1109/CVPR52729.2023.00718http://dx.doi.org/10.1109/CVPR52729.2023.00718]
Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: A neural image caption generator//Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, Massachusetts: 3156-3164 [DOI: 10.48550/arXiv.1411.4555http://dx.doi.org/10.48550/arXiv.1411.4555]
Wang F, Zhou Y, Wang S, Vardhanabhuti V and Yu L. 2022. Multi-granularity cross-modal alignment for generalized medical visual representation learning. Advances in Neural Information Processing Systems,35: 33536-33549 [DOI: 10.48550/arXiv.2210.06044http://dx.doi.org/10.48550/arXiv.2210.06044]
Wang J, Bhalerao A and He Y. 2022. Cross-modal prototype driven network for radiology report generation//European Conference on Computer Vision. Cham: Springer Nature Switzerland: 563-579 [DOI: 10.1007/978-3-031-19833-5_33http://dx.doi.org/10.1007/978-3-031-19833-5_33]
Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J and Tang J. 2021. KEPLER: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics,9: 176-194 [DOI: 10.1162/tacl_a_00360http://dx.doi.org/10.1162/tacl_a_00360]
Wang X, Peng Y, Lu L, Lu Z and Summers R M. 2018. Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah: 9049-9058 [DOI: 10.48550/arXiv.1801.04334http://dx.doi.org/10.48550/arXiv.1801.04334]
Wang Z, Han H, Wang L, Li X and Zhou L. 2022. Automated radiographic report generation purely on transformer: A multicriteria supervised approach. IEEE Transactions on Medical Imaging,41(10): 2803-2813 [DOI: 10.1109/TMI.2022.3171661http://dx.doi.org/10.1109/TMI.2022.3171661]
Wang Z, Liu L, Wang L and Zhou L. 2023. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: 11558-11567 [DOI: 10.48550/arXiv.2304.02211http://dx.doi.org/10.48550/arXiv.2304.02211]
Wang Z, Tang M, Wang L, Li X and Zhou L. 2022. A medical semantic-assisted transformer for radiographic report generation//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland: 655-664 [DOI: 10.1007/978-3-031-16437-8_63http://dx.doi.org/10.1007/978-3-031-16437-8_63]
Xue Y, Tan Y, Tan L, Qin J and Xiang X. 2024. Generating radiology reports via auxiliary signal guidance and a memory-driven network. Expert Systems with Applications,237: 121260 [DOI: 10.1016/j.eswa.2023.121260http://dx.doi.org/10.1016/j.eswa.2023.121260]
Xue Y, Xu T, Rodney Long L, Xue Z, Antani S, Thoma G R and Huang X. 2018. Multimodal recurrent model with attention for automated radiology report generation//Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference. Granada, Spain: Springer International Publishing: 457-466 [DOI: 10.1007/978-3-030-00928-1_52http://dx.doi.org/10.1007/978-3-030-00928-1_52]
Yang S, Wu X, Ge S, Zheng Z, Zhou S K and Xiao L. 2023. Radiology report generation with a learned knowledge base and multi-modal alignment. Medical Image Analysis,86: 102798 [DOI: 10.1016/j.media.2023.102798http://dx.doi.org/10.1016/j.media.2023.102798]
Yang S, Wu X, Ge S, Zhou S K and Xiao L. 2022. Knowledge matters: Chest radiology report generation with general and specific knowledge. Medical image analysis,80: 102510 [DOI: 10.1016/j.media.2022.102510http://dx.doi.org/10.1016/j.media.2022.102510]
You D, Liu F, Ge S, Xie X, Zhang J and Wu X. 2021. Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report generation//Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference. Strasbourg, France: Springer International Publishing: 72-82 [DOI: 10.1007/978-3-030-87199-4_7http://dx.doi.org/10.1007/978-3-030-87199-4_7]
Zhang Y, Jiang H, Miura Y, Manning C D and Langlotz C P. 2022. Contrastive learning of medical visual representations from paired images and text//Machine Learning for Healthcare Conference. PMLR: 2-25.
Zhang Y, Wang X, Xu Z, Yu Q, Yuille A and Xu D. 2020. When radiology report generation meets knowledge graph//Proceedings of the AAAI conference on artificial intelligence. New York, USA: 34(07): 12910-12917 [DOI: 10.1609/aaai.v34i07.6989http://dx.doi.org/10.1609/aaai.v34i07.6989]
相关作者
相关机构