Segmentation of abdominal CT and cardiac MR images with multi scale visual attention

Jiang Ting; Li Xiaoning

doi:10.11834/jig.221032

Medical Image Processing | Views : 0 下载量: 3 CSCD: 1

PDF
Export
Share
Collection
Album

Segmentation of abdominal CT and cardiac MR images with multi scale visual attention
Vol. 29, Issue 1, Pages: 268-279(2024)
Published： 16 January 2024 ，
DOI： 10.11834/jig.221032
稿件说明：

移动端阅览

蒋婷，李晓宁. 2024. 采用多尺度视觉注意力分割腹部CT和心脏MR图像. 中国图象图形学报， 29(01):0268-0279

Jiang Ting， Li Xiaoning. 2024. Segmentation of abdominal CT and cardiac MR images with multi scale visual attention. Journal of Image and Graphics， 29(01):0268-0279
蒋婷，李晓宁. 2024. 采用多尺度视觉注意力分割腹部CT和心脏MR图像. 中国图象图形学报， 29(01):0268-0279 DOI： 10.11834/jig.221032.

Jiang Ting， Li Xiaoning. 2024. Segmentation of abdominal CT and cardiac MR images with multi scale visual attention. Journal of Image and Graphics， 29(01):0268-0279 DOI： 10.11834/jig.221032.

摘要

目的

医学图像分割是计算机辅助诊断和手术规划的重要步骤，但是由于人体器官结构复杂、组织边缘模糊等问题，其分割效果还有待提高。由于视觉Transformer（vision Transformer，ViT）在计算机视觉领域取得了成功，受到医学图像分割研究者的青睐。但是基于ViT的医学图像分割网络，将图像特征展平成一维序列，忽视了图像的二维结构，且ViT所需的计算开销相当大。

方法

针对上述问题，提出了以多尺度视觉注意力（multi scale visual attention，MSVA）为基础、Transformer作为主干网络的U型网络结构MSVA-TransUNet。其采用的多尺度视觉注意力是一种由多个条状卷积实现的注意力机制，采用一个条状卷积对近似一个大核卷积的操作，采用不同的条状卷积对近似不同的大核卷积，从不同的尺度获取图像的信息。

结果

在腹部多器官分割和心脏分割数据集上的实验结果表明：本文网络与基线模型相比，平均Dice分别提高了3.74%和1.58%，其浮点数运算量是多头注意力机制的1/278，网络参数量为15.31 M，是TransUNet的1/6.88。

结论

本文网络媲美当前较先进的网络TransUNet和Swin-UNet，采用多尺度视觉注意力代替多头注意力，在减少计算开销的同时在分割性能上同样具有优势。本文代码开源地址：

https://github.com/BeautySilly/VA-TransUNet

。

Abstract

Objective

Medical image segmentation is one of the important steps in computer-aided diagnosis and surgery planning. However， due to the complex， diverse structure of various human organs， blurred tissue edges， size， and other problems， the segmentation performance is poor and the segmentation effect needs to be further improved， while more accurate segmentation performance can more effectively help doctors to carry out treatment and provide advice. Recently， deep-learning-based methods have become a hot spot for researching medical image segmentation. Vision Transformer （ViT）， which has achieved great success in the field of natural language processing， has also flourished in the field of computer vision； therefore， it is favored by medical image segmentation researchers. However， current medical image segmentation networks based on ViT flatten image features into 1D sequences， ignoring the 2D structure of images and the connections between them. Moreover， the secondary computational complexity of the multihead self-attention （MHSA） mechanism of ViT increases the required computational overhead.

Method

To address the above problems， this paper proposes MSVA-TransUNet， a U-shaped network structure with Transformer as the backbone network based on multi scale vision attention， an attention mechanism implemented by multiple stripe convolutions. The structure is similar to the multihead attention mechanism， which uses convolutional operations to obtain long-distance dependencies. First， the network uses convolution kernels of different sizes to extract features of images of dissimilarsizes， uses a pair of strip convolution operations to approximate a large kernel convolution instead， and does not use dissimilarsizes of strip convolution to approximate diverse large kernel convolutions， which can capture local information using convolution， while large convolution kernels can also learn long-distance dependence of images. Second， strip convolution belongs to lightweight convolution， which can remarkably reduce the number of parameters and floating-point operations of the network and lower the required computational overhead， because the computational overhead of convolution is much smaller than the overhead required by the secondary computational complexity of multihead attention. Further， it avoids converting the image into a 1D sequence for input to vision Transformer and makes full use of the 2D structure of the image to learn the features of the image. Finally， replacing the first patch embedding in the encoding stage with a convolution stem avoids directly converting low channel counts to high channel counts， which runs counter to the typical structure of convolutional neural networks （CNNs） while retaining the structure of patch embeddings elsewhere.

Result

Experimental results on the abdominal multiorgan segmentation dataset （mainly containing eight organs） and the heart segmentation dataset （comprising three parts of the heart） show the segmentation accuracy of the proposed network in this paper is improved compared with the baseline model. The average Dice of the abdominal multiorgan segmentation dataset is improved by 3.74%， and the average Dice of the heart segmentation dataset is improved by 1.58%. Their floating-point operations and number of parameters are reduced compared with the MHSA mechanism and the large kernel convolution. The MHSA mechanism’s floating-point operation is 1/278 of the self-attention mechanism， and the number of network parameters is 15.31 M， which is 1/6.88 of the TransUNet.

Conclusion

Experimental results show the proposed network is comparable to or even exceeds the current state-of-the-art networks. The multiscale visual attention mechanism is used instead of the multihead self-attention mechanism， which can also capture long-distance relationships and extract graphic long-distance features. Segmentation performance is improved while reducing computational overhead， that is， the proposed network exhibit certain advantages. However， due to the specificity of the location and small size of some organs， the networks do not have enough feature learning ability for this part of the organs； hence， its segmentation accuracy still needs to be further improved， and we will continue to study how to improve the segmentation performance of this part of the organs in depth. The code of this paper will be open source soon：

https：//github.com/BeautySilly/VA-TransUNet

https://github.com/BeautySilly/VA-TransUNet

关键词

医学图像分割视觉注意力Transformer注意力机制腹部多器官分割心脏分割

Keywords

medical image segmentationvisual attentionTransformerattention mechanismabdominal multi-organ segmentationcardiac segmentation

references

Cao H， Wang Y Y， Chen J， Jiang D S， Zhang X P， Tian Q and Wang M N. 2021. Swin-UNet： unet-like pure Transformer for medical image segmentation ［EB/OL］. ［2022-09-10］. http://arxiv.org/pdf/2015.05537.pdfhttp://arxiv.org/pdf/2015.05537.pdf

Chen J， Frey E C， He Y F， Segars W P， Li Y and Du Y. 2022. TransMorph： Transformer for unsupervised medical image registration. Medical Image Analysis， 82： #102615 ［DOI：10.1016/j.media.2022.102615http://dx.doi.org/10.1016/j.media.2022.102615］

Chen J N， Lu Y Y， Yu Q H， Luo X D， Adeli E， Wang Y， Lu L， Yuille A L and Zhou Y Y. 2021b. TransUNet： Transformers make strong encoders for medical image segmentation ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2102.04306.pdfhttps://arxiv.org/pdf/2102.04306.pdf

Chen J Y， He Y F， Frey E C， Li Y and Du Y. 2021a. ViT-V-Net： vision Transformer for unsupervised volumetric medical image registration ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2104.06468.pdfhttps://arxiv.org/pdf/2104.06468.pdf

Chen L C， Papandreou G， Kokkinos I， Murphy K P and Yuille A L. 2018. DeepLab： semantic image segmentation with deep convolutional nets， atrous convolution， and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence， 40（4）： 834-848 ［DOI： 10.1109/TPAMI.20 17.2699184http://dx.doi.org/10.1109/TPAMI.2017.2699184］

Chu X X， Zhang B， Tian Z， Wei X L and Xia H X. 2023. Do we really need explicit position encodings for vision Transformers？［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2102.10882v1.pdfhttps://arxiv.org/pdf/2102.10882v1.pdf

Ding X H， Zhang X Y， Zhou Y Z， Han J G， Ding G G and Sun J. 2022. Scaling up your kernels to 31 × 31： revisiting large kernel design in CNNs//Proceedings of 2022 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： 11953-11965 ［DOI：10.1109/CVPR52688.2022.01166http://dx.doi.org/10.1109/CVPR52688.2022.01166.］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Feng X， Tustison N J， Patel S H and Meyer C H. 2020. Brain tumor segmentation using an ensemble of 3D U-nets and overall survival prediction using radiomic features ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/1812/1812.01049.pdfhttps://arxiv.org/pdf/1812/1812.01049.pdf

Gaál G A， Maga B A and Lukács A. 2020. Attention U-Net based adversarial architectures for chest X-ray lung segmentation ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2003.10304.pdfhttps://arxiv.org/pdf/2003.10304.pdf

Guo M H， Lu C Z， Hou Q B， Liu Z N， Cheng M M and Hu S M. 2022a. SegNeXt： rethinking convolutional attention design for semantic segmentation ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2209.08575.pdfhttps://arxiv.org/pdf/2209.08575.pdf

Guo M H， Lu C Z， Liu Z N， Cheng M M and Hu S M. 2022b. Visual attention network ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2202.09741.pdfhttps://arxiv.org/pdf/2202.09741.pdf

Hu J， Shen L， Albanie S， Sun G and Vedaldi A. 2019. Gather-Excite： exploiting feature context in convolutional neural networks ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/1810.12348.pdfhttps://arxiv.org/pdf/1810.12348.pdf

Huang H M， Lin L F， Tong R F， Hu H J， Zhang Q W， Iwamoto Y， Han X H， Chen Y W and Wu J. 2020. UNet 3+： a full-scale connected UNet for medical image segmentation//Proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. Barcelona， Spain： IEEE： 1055-1059 ［DOI： 10.1109/ICASSP40776.2020.9053405http://dx.doi.org/10.1109/ICASSP40776.2020.9053405］

Isensee F， Jaeger P F， Kohl S A A， Petersen J and Maier-Hein K H. 2021. nnU-Net： a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods， 18（2）： 203-211 ［DOI： 10.1038/s41592-020 -01008-zhttp://dx.doi.org/10.1038/s41592-020-01008-z］

Li C Y， Yang J W， Zhang P C， Gao M， Xiao B， Dai X Y， Yuan L and Gao J F. 2022. Efficient self-supervised vision Transformers for representation learning ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2106.09785.pdfhttps://arxiv.org/pdf/2106.09785.pdf

Li X M， Chen H， Qi X J， Dou Q， Fu C W and Heng P A. 2018. H-DenseUNet： hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Transactions on Medical Imaging， 37（12）： 2663-2674 ［DOI： 10.1109/TMI.2018.2845918http://dx.doi.org/10.1109/TMI.2018.2845918］

Li Y W， Zhang K， Cao J Z， Timofte R and van Gool L. 2021. LocalViT： bringing locality to vision Transformers ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2104.05707.pdfhttps://arxiv.org/pdf/2104.05707.pdf

Litjens G， Kooi T， Bejnordi B E， Setio A A A， Ciompi F， Ghafoorian M， Van Der Laak J A W M， Van Ginneken B and S􀅡nchez C I. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis， 42： 60-88 ［DOI： 10.1016/j.media.2017.07.005http://dx.doi.org/10.1016/j.media.2017.07.005］

Liu H X， Dai Z H， So D R and Le Q V. 2021a. Pay attention to MLPs. ［EB/OL］. ［2022-09-11］. https://arxiv.org/pdf/2105.08050.pdfhttps://arxiv.org/pdf/2105.08050.pdf

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021b. Swin Transformer： hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Milletari F， Navab N and Ahmadi S A. 2016. V-Net： fully convolutional neural networks for volumetric medical image segmentation//Proceedings of the 4th International Conference on 3D Vision （3DV）. Stanford， USA： IEEE： 565-571 ［DOI： 10.1109/3DV.2016.79http://dx.doi.org/10.1109/3DV.2016.79］

Park J， Woo S， Lee J Y and Kweon I S. 2018. BAM： bottleneck attention module ［EB/OL］. ［2022-09-11］. https://arxiv. org/pdf/1807.06514.pdfhttps://arxiv.org/pdf/1807.06514.pdf

Ronneberger O， Fischer P and Brox T. 2015. U-Net： convolutional networks for biomedical image segmentation ［EB/OL］. ［2022-9-11］. https://arxiv.org/pdf/1505.04597.pdfhttps://arxiv.org/pdf/1505.04597.pdf

Tolstikhin I， Houlsby N， Kolesnikov A， Beyer L， Zhai X H， Unterthiner T， Yung J， Steiner A， Keysers D， Uszkoreit J， Lucic M and Dosovitskiy A. 2021. MLP-Mixer： an all-MLP architecture for vision ［EB/OL］. ［2022-09-11］. https://arxiv.org/pdf/2105.01601.pdfhttps://arxiv.org/pdf/2105.01601.pdf

Trockman A and Kolter J Z. 2022. Patches are all you need？［EB/OL］. ［2022-09-11］. https://arxiv.org/pdf/2201.09792.pdfhttps://arxiv.org/pdf/2201.09792.pdf

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser L and Polosukhin I. 2023. Attention is all you need ［EB/OL］. ［2022-09-11］. https://arxiv.org/pdf/1706.03762.pdfhttps://arxiv.org/pdf/1706.03762.pdf

Wu H P， Xiao B， Codella N， Liu M C， Dai X Y， Yuan L and Zhang L. 2021. CvT： introducing convolutions to vision Transformers//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Montreal， Canada： 22-31. ［DOI： 10.1109/ICCV 48922.2021.00009http://dx.doi.org/10.1109/ICCV48922.2021.00009］

Xiao T T， Singh M， Mintun E， Darrell T， Doll􀅡r P and Girshick R. 2021. Early convolutions help Transformers see better ［EB/OL］. ［2022-09-11］. https://arxiv.org/pdf/2106.14881.pdfhttps://arxiv.org/pdf/2106.14881.pdf

Yin X H， Wang Y C and Li D Y. 2021. Suvery of medical image segmentation technology based on U-net structure improvement. Journal of Software， 32（2）： 519-550

殷晓航，王永才，李德英. 2021. 基于U-Net结构改进的医学影像分割技术综述. 软件学报， 32（2）： 519-550 ［DOI： 10.13328/j.cnki.jos.006104http://dx.doi.org/10.13328/j.cnki.jos.006104］

Zheng G Y， Liu X B and Han G H. 2018. Survey on medical lmage computer aided detection and diagnosis systems. Journal of Software， 29（5）： 1471-1514

郑光远，刘峡壁，韩光辉. 2018. 医学影像计算机辅助检测与诊断系统综述. 软件学报， 29（5）： 1471-1514 ［DOI： 10. 13328/j.cnki.jos.005519http://dx.doi.org/10.13328/j.cnki.jos.005519］

Zheng H， Wang L L， Chen Y C and Li X N. 2022. Cross U-Net： reconstructing cardiac MR image for segmentation//Proceedings of 2022 IEEE International Conference on Multimedia and Expo （ICME）. Taipei， China： IEEE： 1-6 ［DOI： 10.1109/ICME52920. 2022.9859940http://dx.doi.org/10.1109/ICME52920.2022.9859940］

Zheng S X， Lu J C， Zhao H S， Zhu X T， Luo Z K， Wang Y B， Fu Y W， Feng J F， Xiang T， Torr P H S and Zhang L. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 6877-6886 ［DOI： 10. 1109/CVPR46437.2021.00681http://dx.doi.org/10.1109/CVPR46437.2021.00681］

Zhou Z W， Siddiquee M M R， Tajbakhsh N and Liang J M. 2018. UNet++： a nested U-Net architecture for medical image segmentation//Proceedings of Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support： 4th International Workshop， DLMIA 2018， and 8th International Workshop， ML-CDS 2018， held in conjunction with MICCAI 2018. Granada， Spain： Springer： 3-11 ［DOI： 10.1007/978- 3-030-00889-5_1http://dx.doi.org/10.1007/978-3-030-00889-5_1］

Alert me when the article has been cited

提交

Blueprint separable convolution Transformer network for lightweight image super-resolution

Transformer-based multi-style information transfer in image processing

Spine CT image segmentation based on Transformer

Object detection techniques based on deep learning for aerial remote sensing images： a survey

Knowledge-representation-enhanced multimodal Transformer for scene text visual question answering