融合交叉注意力与双编码器的医学图像分割
Dual-encoder global-local cross-attention network for medical image segmentation
- 2024年29卷第11期 页码:3462-3475
纸质出版日期: 2024-11-16
DOI: 10.11834/jig.230705
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-11-16 ,
移动端阅览
李赫, 刘建军, 肖亮. 2024. 融合交叉注意力与双编码器的医学图像分割. 中国图象图形学报, 29(11):3462-3475
Li He, Liu Jianjun, Xiao Liang. 2024. Dual-encoder global-local cross-attention network for medical image segmentation. Journal of Image and Graphics, 29(11):3462-3475
目的
2
在现有的医学图像分割算法中,卷积神经网络(convolutional neural network,CNN)和Transformer相结合的方法占据了主流。然而,这些方法通常不能有效地结合CNN和Transformer所提取到的局部和全局信息。针对这一问题,提出了一种基于全局—局部交叉注意力的双编码器分割网络(dual-encoder global-local cross attention network,DGLCANet)。
方法
2
DGLCANet是基于UNet的编码器—解码器结构实现的。首先,采用CNN和交叉形状窗口Transformer(CSWin Transformer)为主的双编码器结构来提取图像丰富的全局上下文特征以及局部纹理特征。其次,在CNN分支中,引入一个全局—局部交叉注意力Transformer模块来使双分支所提取到的信息关联起来。最后,为了减小编码器与解码器之间的特征差距,本文在原始跳跃连接中插入了一个特征自适应模块。
结果
2
将DGLCANet与9种先进的分割算法在4个公开数据集上进行实验对比,其分割效果在交并比(intersection over union,IoU)、Dice系数(Dice coefficient)、准确度(accuracy, ACC)和召回率(recall)指标上均有提高,在4个数据集上的IoU分别达到85.1%、83.34%、68.01%和85.63%,相较于经典算法UNet分别提升了8.07%、6.01%、7.83%和3.87%。
结论
2
DGLCANet综合了基于CNN方法和基于Transformer方法的优点,充分利用了图像中的全局和局部信息,具有更优异的分割性能。
Objective
2
With the rapid advancement of medical imaging technology, medical image segmentation has become a popular topic in the field of medical image processing and has been the subject of extensive study. Medical image segmentation has a wide range of applications and research values in medical research and practice. The segmentation results of medical images can be used by physicians to determine the location, size, and shape of lesions, providing an accurate basis for diagnosis and treatment. In recent years, UNet based on convolutional neural networks (CNNs) has become a baseline architecture for medical image segmentation. However, this architecture cannot effectively extract global context information due to the limited receptive field of CNNs. The Transformer was originally designed to solve this problem but was limited in capturing local information. Therefore, hybrid networks of CNN and Transformer based on UNet architecture are gradually becoming popular. However, existing methods encounter some shortcomings. For example, these methods typically cannot effectively combine the global and local information extracted by CNN and Transformer. By contrast, while the original skip connection can recover some location information lost by the target features in the downsampling stage, this connection may fail to capture all the fine-grained details, ultimately affecting the accuracy of the predicted segmentation. This paper proposes a dual-encoder global-local cross-attention network with CNN and Transformer (DGLCANet) to address these issues.
Method
2
First, a dual-encoder network is adopted to extract rich local and global information from the images, which combines the advantages of CNNs and Transformer networks. In the encoder stage, Transformer and CNN branches are used to extract global and local information, respectively. In addition, the CSWin Transformer with low calculation costs is used in the Transformer branch to reduce the calculation cost of the model. Next, a global-local cross-attention Transformer module is proposed to fully utilize the global and local information extracted by the dual-encoder branch. The core of this module is the cross-attention mechanism, which can further obtain the correlation between global and local features by interacting the information of the two branches. Finally, a feature adaptation block is designed in the skip connection of DGLCANet to compensate for the shortcomings of the original skip connections. The feature adaptation module aims to adaptively match the features between the encoder and decoder, reducing the feature gap between them and improving the adaptive capability of the model. Meanwhile, the module can also recover detailed positional information lost during the encoder downsampling process. Tests are performed on four public datasets, including ISIC-2017, ISIC-2018, BUSI, and the 2018 Data Science Bowl. Among them, ISIC-2017 and ISIC-2018 are used for dermoscopic images of melanoma detection, containing 2 000 and 2 596 images, respectively. The BUSI dataset, which contains 780 images, is a breast ultrasound dataset for detecting breast cancer. The 2018 Data Science Bowl dataset, which contains a total of 670 images, is used for examining cell nuclei in different microscope images. The resolution of all images is set to 256 × 256 pixels and randomly divided into training and test sets according to the ratio of 8∶2. DGLCANet is implemented in the PyTorch framework and was trained on an NVIDIA GeForce RTX 3090Ti GPU with 24 GB of memory. In the experiment, the binary cross-entropy and dice loss functions are mixed in proportion to construct a new loss function. Furthermore, the Adam optimizer with an initial learning rate of 0.001, a momentum parameter of 0.9, and a weight decay of 0.000 1 is employed.
Result
2
In this study, four evaluation metrics, including intersection over union, Dice coefficient (Dice), accuracy, and recall, are used to evaluate the effectiveness of the proposed method. In theory, large values of these evaluation metrics lead to superior segmentation effects. Experimental results show that on the four datasets, the dice coefficient reaches 91.88%, 90.82%, 80.71%, and 92.25%, which are 5.87%, 5.37%, 4.65%, and 2.92% higher than the classic method UNet, respectively. Compared with recent state-of-the-art methods, the proposed method also demonstrates its superiority. Furthermore, the graph of the visualized results demonstrates that the proposed method effectively predicts the boundary area of the image and distinguishes the lesion area from the normal area. Meanwhile, compared with other methods, the proposed method can still achieve better segmentation results under the condition of multiple interference factors such as brightness, which are remarkably close to the ground truth. The results of a series of ablation experiments also show that each of the proposed components demonstrates satisfactory performance.
Conclusion
2
In this study, a dual-encoder medical image segmentation method that integrates global-local attention mechanism is proposed. The experimental results demonstrate that the proposed method not only improves segmentation accuracy but also obtains satisfactory segmentation results when processing complex medical images. Future work will focus on further optimization and in-depth research to promote the practical application of this method and will contribute to important breakthroughs and advancements in the field of medical image segmentation.
医学图像分割卷积神经网络(CNN)双编码器交叉注意力机制Transformer
medical image segmentationconvolutional neural network(CNN)dual-encodercross attention mechanismTransformer
Cao H, Wang Y Y, Chen J, Jiang D S, Zhang X P, Tian Q and Wang M N. 2022. Swin-Unet: Unet-like pure Transformer for medical image segmentation//Proceedings of 2022 European Conference on Computer Vision. Tel Aviv, Israel: Springer: 205-218 [DOI: 10.1007/978-3-031-25066-8_9http://dx.doi.org/10.1007/978-3-031-25066-8_9]
Chen J N, Lu Y Y, Yu Q H, Luo X D, Adeli E, Wang Y, Lu L, Yuille A L and Zhou Y Y. 2021. TransUNet: Transformers make strong encoders for medical image segmentation [EB/OL]. [2023-09-18]. https://arxiv.org/pdf/2102.04306.pdfhttps://arxiv.org/pdf/2102.04306.pdf
Chen L C, Papandreou G, Kokkinos I, Murphy K and Yuille A L. 2018. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848 [DOI: 10.1109/TPAMI.2017.2699184http://dx.doi.org/10.1109/TPAMI.2017.2699184]
Cheng J L, Tian S W, Yu L, Liu S J, Wang C Q, Ren Y, Lu H C and Zhu M. 2022. DDU-net: a dual dense U-structure network for medical image segmentation. Applied Soft Computing, 126: #109297 [DOI: 10.1016/j.asoc.2022.109297http://dx.doi.org/10.1016/j.asoc.2022.109297]
Diakogiannis F I, Waldner F, Caccetta P and Wu C. 2020. ResUNet-a: a deep learning framework for semantic segmentation of remotely sensed data. ISPRS Journal of Photogrammetry and Remote Sensing, 162: 94-114 [DOI: 10.1016/j.isprsjprs.2020.01.013http://dx.doi.org/10.1016/j.isprsjprs.2020.01.013]
Dong X Y, Bao J M, Chen D D, Zhang W M, Yu N H, Yuan L, Chen D and Guo B N. 2022. CSWin Transformer: a general vision Transformer backbone with cross-shaped windows//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 12114-12124 [DOI: 10.1109/cvpr52688.2022.01181http://dx.doi.org/10.1109/cvpr52688.2022.01181]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth16 × 16 words: Transformers for image recognition at scale//Proceedings of the 9th International Conference on Learning Representations [s.l.]: OpenReview.net
Gudhe N R, Behravan H, Sudah M, Okuma H, Vanninen R, Kosma V M and Mannermaa A. 2021. Multi-level dilated residual network for biomedical image segmentation. Scientific Reports, 11(1): #14105 [DOI: 10.1038/s41598-021-93169-whttp://dx.doi.org/10.1038/s41598-021-93169-w]
Guo R H, Niu D T, Qu L and Li Z B. 2021. SOTR: segmenting objects with Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 7137-7146 [DOI: 10.1109/iccv48922.2021.00707http://dx.doi.org/10.1109/iccv48922.2021.00707]
Hatamizadeh A, Tang Y C, Nath V, Yang D, Myronenko A, Landman B, Roth H R and Xu D G. 2022. UNETR: Transformers for 3D medical image segmentation//Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 1748-1758 [DOI: 10.1109/wacv51458.2022.00181http://dx.doi.org/10.1109/wacv51458.2022.00181]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Heidari M, Kazerouni A, Soltany M, Azad R, Aghdam E K, Cohen-Adad J and Merhof D. 2023. HiFormer: hierarchical multi-scale representations using Transformers for medical image segmentation//Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 6191-6201 [DOI: 10.1109/wacv56688.2023.00614http://dx.doi.org/10.1109/wacv56688.2023.00614]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [DOI: 10.1109/cvpr.2018.00745http://dx.doi.org/10.1109/cvpr.2018.00745]
Huang G, Liu Z, van der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2261-2269 [DOI: 10.1109/CVPR.2017.243http://dx.doi.org/10.1109/CVPR.2017.243]
Huang H M, Lin L F, Tong R F, Hu H J, Zhang Q W, Iwamoto Y, Han X H, Chen Y W and Wu J. 2020. UNet 3+: a full-scale connected UNet for medical image segmentation//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE: 1055-1059 [DOI: 10.1109/ICASSP40776.2020.9053405http://dx.doi.org/10.1109/ICASSP40776.2020.9053405]
Ibtehaz N and Rahman M S. 2020. MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Networks, 121: 74-87 [DOI: 10.1016/j.neunet.2019.08.025http://dx.doi.org/10.1016/j.neunet.2019.08.025]
Isensee F, Jaeger P F, Kohl S A A, Petersen J and Maier-Hein K H. 2021. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2): 203-211 [DOI: 10.1038/s41592-020-01008-zhttp://dx.doi.org/10.1038/s41592-020-01008-z]
Krizhevsky A, Sutskever I and Hinton G E. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84-90 [DOI: 10.1145/3065386http://dx.doi.org/10.1145/3065386]
Lin A L, Chen B Z, Xu J Y, Zhang Z, Lu G M and Zhang D. 2022. DS-transUNet: dual swin Transformer U-Net for medical image segmentation. IEEE Transactions on Instrumentation and Measurement, 71: 1-15 [DOI: 10.1109/tim.2022.3178991http://dx.doi.org/10.1109/tim.2022.3178991]
Liu J J, Wu Z B, Xiao L and Wu X J. 2022. Model inspired autoencoder for unsupervised hyperspectral image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 60: #5522412 [DOI: 10.1109/tgrs.2022.3143156http://dx.doi.org/10.1109/tgrs.2022.3143156]
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021. Swin Transformer: hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 9992-10002 [DOI: 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986]
Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE Computer Society: 3431-3440 [DOI: 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965]
Milletari F, Navab N and Ahmadi S A. 2016. V-Net: Fully convolutional neural networks for volumetric medical image segmentation//Proceedings of the 4th International Conference on 3D Vision (3DV). Stanford, USA: IEEE: 565-571 [DOI: 10.1109/3dv.2016.79http://dx.doi.org/10.1109/3dv.2016.79]
Oktay O, Schlemper J, Le Folgoc L, Lee W, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla N Y, Kainz B, Glocker B and Rueckert D. 2018. Attention U-Net: learning where to look for the pancreas [EB/OL]. [2023-10-07]. https://arxiv.org/pdf/1804.03999.pdfhttps://arxiv.org/pdf/1804.03999.pdf
Ramesh K K D, Kumar G K, Swapna K, Datta D and Rajest S S. 2021. A review of medical image segmentation algorithms. EAI Endorsed Transactions on Pervasive Health and Technology, 7(27): #6 [DOI: 10.4108/eai.12-4-2021.169184http://dx.doi.org/10.4108/eai.12-4-2021.169184]
Ronneberger O, Fischer P and Brox T. 2015. U-Net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241 [DOI: 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]
Valanarasu J M J and Patel V M. 2022. UNeXt: MLP-based rapid medical image segmentation network//Proceedings of the 25th International Conference on Medical Image Computing and Computer Assisted Intervention. Singapore, Singapore: Springer: 23-33 [DOI: 10.1007/978-3-031-16443-9_3http://dx.doi.org/10.1007/978-3-031-16443-9_3]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010 [DOI: 10.5555/3295222.3295349http://dx.doi.org/10.5555/3295222.3295349]
Wang J W, Tian S W, Yu L, Wang Y T, Wang F and Zhou Z C. 2022a. SBDF-Net: a versatile dual-branch fusion network for medical image segmentation. Biomedical Signal Processing and Control, 78: #103928 [DOI: 10.1016/j.bspc.2022.103928http://dx.doi.org/10.1016/j.bspc.2022.103928]
Wang R S, Lei T, Cui R X, Zhang B T, Meng H Y and Nandi A K. 2022b. Medical image segmentation using deep learning: a survey. IET Image Processing, 16(5): 1243-1267 [DOI: 10.1049/ipr2.12419http://dx.doi.org/10.1049/ipr2.12419]
Wang S F, Liu Y K, Sun Y F and Yin B C. 2023. SACNet: shuffling atrous convolutional U‐Net for medical image segmentation. IET Image Processing, 17(4): 1236-1252 [DOI: 10.1049/ipr2.12709http://dx.doi.org/10.1049/ipr2.12709]
Wu H S, Chen S H, Chen G L, Wang W, Lei B Y and Wen Z K. 2022a. FAT-Net: feature adaptive Transformers for automated skin lesion segmentation. Medical Image Analysis, 76: #102327 [DOI: 10.1016/j.media.2021.102327http://dx.doi.org/10.1016/j.media.2021.102327]
Wu Y L, Wang G L, Wang Z Y, Wang H R and Li Y. 2022b. DI-Unet: dimensional interaction self-attention for medical image segmentation. Biomedical Signal Processing and Control, 78: #103896 [DOI: 10.1016/j.bspc.2022.103896http://dx.doi.org/10.1016/j.bspc.2022.103896]
Xie Y T, Zhang J P, Shen C H and Xia Y. 2021. CoTr: efficiently bridging CNN and Transformer for 3D medical image segmentation//Proceedings of the 24th International Conference on Medical Image Computing and Computer-Assisted Intervention. Strasbourg, France: Springer: 171-180 [DOI: 10.1007/978-3-030-87199-4_16http://dx.doi.org/10.1007/978-3-030-87199-4_16]
Xu G P, Zhang X, He X W and Wu X R. 2023. LeVit-UNet: make faster encoders with Transformer for medical image segmentation//Proceeding of the 6th Chinese Conference on Pattern Recognition and Computer Vision. Xiamen, China: Springer: 42-53 [DOI: 10.1007/978-981-99-8543-2_4http://dx.doi.org/10.1007/978-981-99-8543-2_4]
Zhang X F, Zhang S, Zhang D H and Liu R. 2023. Group attention-based medical image segmentation model. Journal of Image and Graphics, 28(10): 3231-3242
张学峰, 张胜, 张冬晖, 刘瑞. 2023. 引入分组注意力的医学图像分割模型. 中国图象图形学报, 28(10): 3231-3242[DOI: 10.11834/jig.220748http://dx.doi.org/10.11834/jig.220748]
Zhou T, Dong Y L, Huo B Q, Liu S and Ma Z J. 2021. U-Net and its applications in medical image segmentation: a review. Journal of Image and Graphics, 26(9): 2058-2077
周涛, 董雅丽, 霍兵强, 刘珊, 马宗军. 2021. U-Net网络医学图像分割应用综述. 中国图象图形学报, 26(9): 2058-2077 [DOI: 10.11834/jig.200704http://dx.doi.org/10.11834/jig.200704]
Zhou Z W, Siddiquee M M R, Tajbakhsh N and Liang J M. 2020. UNet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging, 39(6): 1856-1867 [DOI: 10.1109/TMI.2019.2959609http://dx.doi.org/10.1109/TMI.2019.2959609]
Zhu X Z, Su W J, Lu L W, Li B, Wang X G and Dai J F. 2021. Deformable DETR: deformable Transformers for end-to-end object detection//Proceedings of the 9th International Conference on Learning Representations. [s.l.]: OpenReview.net
相关作者
相关机构