Potential and prospects of segment anything model： a survey

Wang Miao; Huang Zhizhong; He Huiguang; Lu Huchuan; Shan Hongming; Zhang Junping

doi:10.11834/jig.230792

Generative Large Model and Human-computer Interaction | Views : 0 下载量: 27 CSCD: 0

PDF
Export
Share
Collection
Album

Potential and prospects of segment anything model： a survey
Vol. 29, Issue 6, Pages: 1479-1509(2024)
Published： 16 June 2024 ，
DOI： 10.11834/jig.230792
稿件说明：

移动端阅览

王淼，黄智忠，何晖光，卢湖川，单洪明，张军平. 2024. 分割一切模型SAM的潜力与展望：综述. 中国图象图形学报， 29(06):1479-1509

Wang Miao， Huang Zhizhong， He Huiguang， Lu Huchuan， Shan Hongming， Zhang Junping. 2024. Potential and prospects of segment anything model： a survey. Journal of Image and Graphics， 29(06):1479-1509
王淼，黄智忠，何晖光，卢湖川，单洪明，张军平. 2024. 分割一切模型SAM的潜力与展望：综述. 中国图象图形学报， 29(06):1479-1509 DOI： 10.11834/jig.230792.

Wang Miao， Huang Zhizhong， He Huiguang， Lu Huchuan， Shan Hongming， Zhang Junping. 2024. Potential and prospects of segment anything model： a survey. Journal of Image and Graphics， 29(06):1479-1509 DOI： 10.11834/jig.230792.

摘要

随着基于对比文本—图像对的预训练（contrastive language-image pre-training，CLIP）方法或者模型、聊天生成预训练转换器（chat generative pre-trained Transformer，ChatGPT）、生成预训练转换器-4（generative pre-trained Transformer-4，GPT-4）等基础大模型的出现，通用人工智能（artificial general intelligence， AGI）的研究得到快速发展。AGI旨在为人工智能系统赋予更强大的执行能力，使其能够自主学习、不断进化，解决各种问题和处理不同的任务，从而在多个领域得到广泛应用。这些基础模型在大规模数据集上进行训练后，能够成功应对多样的下游任务。在这一背景下，Meta公司提出的分割一切模型（segment anything model，SAM）于2023年取得重要突破，在图像分割领域获得了优异的性能，以至于被称为图像分割终结者。其原因之一是，通过SAM数据引擎方法用三阶段采集的、包含1 100万图像和超过10亿掩码的分割一切—十亿（segment anything 1 billion，SA-1B）图像分割数据集，同时保证了掩码的品质和多样性，继续导致在分割领域的突破。在SAM开源后不久，科研人员提出了一系列改进的方法和应用。为了能全面深入了解分割一切模型的发展脉络、优势与不足，本文对SAM的研究进展进行了梳理和综述。首先，从基础模型、数据引擎和数据集等多个方面简要介绍了分割一切模型的背景和核心框架。在此基础上，本文详细梳理了目前分割一切模型的改进方法，包括提高推理速度和增进预测精度两个关键方向。然后，深入探讨分割一切模型在图像处理任务、视频相关任务以及其他领域中的广泛应用。这一部分详细介绍了模型在各种任务和数据类型上的卓越性能，突出其在多个领域的泛用性和发展潜力。最后，对分割一切模型未来的发展方向和潜在应用前景进行了深入分析和讨论。

Abstract

The emergence of foundational large-scale models， such as contrastive language-image pre-training（CLIP）， chat generative pre-trained Transformer（ChatGPT）， and generative pre-trained Transformer-4（GPT-4）， has facilitated the significant growth of the field of artificial general intelligence （AGI）. AGI aims to imbue systems with the ability to perform various tasks， which enables them to learn autonomously and evolve. This broad applicability spans various domains and is intended to address diverse problems and accomplish numerous downstream tasks. These models， after being trained on massive datasets， possess the capability to handle a multitude of downstream tasks. In this context， Meta’s segment anything model （SAM） has substantially progressed and introduced the largest image segmentation dataset to date， that is， SA-1B. This dataset includes over 11 million images and more than one billion mask in 2023. One reason is that SA-1B was collected through SAM’s data engine approach in three stages. This approach simultaneously ensures the quality and diversity of these masks， which contributes significantly to breakthroughs in the segmentation domain. This development has profoundly impacted the advancements in the foundational models in the field of computer vision. This study provides a comprehensive understanding of the SAM framework through a detailed review and analysis of relevant research. First， this study delves into three aspects of the background and basic framework of the SAM model. The first aspect involves the tasks of SAM， including traditional image segmentation and prompt-guided interactive image segmentation. The second aspect is the model architecture of SAM， encompassing image encoders， prompt encoders， and mask decoders. The third aspect revolves around the data， including the data engine for collecting datasets and dataset SA-1B. Building upon this foundation， the study then organizes and analyzes methods for improving the SAM model from two perspectives. The first perspective is enhancing inference speed. The reason is that improved inference speed reduces the deployment costs of SAM， which makes it more convenient for application on less powerful devices. The second perspective is enhancing prediction accuracy. Notably， SAM itself lacks specific semantic information， which leads to suboptimal segmentation results in complex scenarios. Thus， considerable research focuses on enhancing the prediction accuracy of SAM. Subsequently， the study thoroughly reviews and analyzes the current applications of the SAM model in various tasks and data types. These applications are divided into three parts： the first part covers applications in image processing-related tasks， including style transfer， object detection， object counting， image editing， complex image segmentation， and medical image segmentation. However， applying SAM directly to medical image segmentation may not yield satisfactory results， which suggests the need for further adjustments in specific scenario tasks. The second part encompasses applications in video-related tasks， including video super-resolution， video object tracking， and audio–visual scene segmentation. The third part explores applications in other directions， such as point cloud segmentation， 3D reconstruction， controllable image caption generation， and data annotation. Through the organization of the applications of SAM in the three parts， the study summarizes the advantages and limitations of applying SAM to various downstream tasks. These analyses can assist researchers in better applying and improving SAM， which enhances its robustness and generalization capabilities. Finally， the study proposes several valuable future research directions for the SAM model. These directions include： 1） modularization： although SAM has already demonstrated excellent performance in certain tasks， its efficiency and flexibility still need to be improved. With the continuous expansion of SAM application domains， many applications have put forward the requirement for SAM to possess new knowledge. Therefore， the model is required to have domain adaptation and continuous learning capabilities. Drawing inspiration from large language models， new modular structures can be added to SAM to enhance its domain adaptation and continuous learning capabilities. 2） Weakly supervised semantic segmentation： in weakly supervised semantic segmentation， retraining model classification and generating pseudo-labels are typically necessary， but they involve time-consuming and intricate steps. Recent studies use SAM as a base model in this domain， which capitalizes on its strong generalization for satisfactory results without fine-tuning. However， although SAM can produce relatively clear results in many explicit scenarios， SAM has difficulty generating accurate segmentation masks in certain semantically ambiguous scenarios because its model does not contain semantic information. We can consider using more diverse weak labels for SAM and incorporating additional post-processing modules to enhance the segmentation accuracy of SAM and improve its performance in weakly supervised semantic segmentation for solving the abovementioned complexity. Exploring the application of SAM as a foundational model in weakly supervised semantic segmentation， which potentially yields promising results. 3） Multimodal fusion for image segmentation： at present， the prompt input of SAM mainly includes four forms： point， target box， split mask， and text prompt. However， the continuous expansion of the application areas of SAM has introduced new requirements for cue input forms. The current focus of SAM is on 2D visual tasks， with potential consideration for future applications in 3D visual tasks. These applications include considering different input modalities for SAM prompts， introducing time-series prompts to address the limitations of SAM in video processing tasks， and further improving the performance of SAM in various video downstream tasks. 4） Efficient fine-tuning of SAM： although SAM has been widely used in various domains， its performance still falls short compared with other state-of-the-art models in the domain in certain specific application scenarios. Studies have shown that its performance is improved by fine-tuning SAM for domain-specific datasets. However， the fine-tuning process is costly due to the large size of the SAM model. Therefore， performing fine-tuning efficiently becomes an important issue. Given the substantial parameter count of SAM， incorporating new modules into the model， freezing its core during training， and only training the newly added modules significantly reduce the training cost. This approach facilitates further research on the application of SAM in various downstream tasks. 5） Leveraging gestalt psychology’s holistic cognitive perspective to enhance SAM’s adversarial robustness： the vulnerability of SAM to attacks may be due to overfitting on local cognitions. Introducing holistic cognition can prevent overfitting on local cognition and resist attacks involving noise. By consolidating and summarizing SAM in this study， SAM can be further developed and applied to drive the advancement of foundational models in the field of computer vision.

关键词

通用人工智能（AGI）计算机视觉图像分割视觉基础模型分割一切模型（SAM）大型语言模型（LLM）

Keywords

artificial general intelligence （AGI）computer visionimage segmentationvisual foundational modelssegment anything model （SAM）large language model （LLM）

references

Bhosale S， Yang H S， Kanojia D and Zhu X T. 2023. Leveraging foundation models for unsupervised audio-visual segmentation ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2309.06728.pdfhttps://arxiv.org/pdf/2309.06728.pdf

Brown T B， Mann B， Ryder N， Subbiah M， Kaplan J， Dhariwal P， Neelakantan A， Shyam P， Sastry G， Askell A， Agarwal S， Herbert-Voss A， Krueger G， Henighan T， Child R， Ramesh A， Ziegler D M， Wu J， Winter C， Hesse C， Chen M， Sigler E， Litwin M， Gray S， Chess B， Clark J， Berner C， McCandlish S， Radford A， Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： ACM： #159 ［DOI： 10.5555/3495724.3495883http://dx.doi.org/10.5555/3495724.3495883］

Cen J Z， Zhou Z W， Fang J M， Yang C， Shen W， Xie L X， Jiang D S， Zhang X P and Tian Q. 2023. Segment anything in 3D with NeRFs//Proceedings of 2023 Annual Conference on Neural Information Processing Systems. New Orleans， USA： NeurIPS： #12308

Chen J Z and Bai X Z. 2023. Learning to “Segment Anything” in thermal infrared images through knowledge distillation with a large scale dataset SATIR ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.07969.pdfhttps://arxiv.org/pdf/2304.07969.pdf

Chen K Y， Liu C Y， Chen H， Zhang H T， Li W Y， Zou Z X and Shi Z W. 2024. RSPrompter： learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing， 62： #4701117 ［DOI： 10.1109/TGRS.2024.3356074http://dx.doi.org/10.1109/TGRS.2024.3356074］

Deng G Y， Zou K， Ren K， Wang M， Yuan X D， Ying S C and Fu H Z. 2023a. SAM-U： multi-box prompts triggered uncertainty estimation for reliable SAM in medical image//Medical Image Computing and Computer Assisted Intervention. Vancouver， Canada： Springer： 368-377 ［DOI： 10.1007/978-3-031-47425-5_33http://dx.doi.org/10.1007/978-3-031-47425-5_33］

Deng R N， Cui C， Liu Q， Yao T Y， Remedios L W， Bao S X， Landman B A， Wheless L E， Coburn L A， Wilson K T， Wang Y H， Zhao S L， Fogo A B， Yang H C， Tang Y C and Huo Y K. 2023b. Segment anything model （SAM） for digital pathology： assess zero-shot segmentation on whole slide imaging ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.04155.pdfhttps://arxiv.org/pdf/2304.04155.pdf

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale//Proceedings of the 9th International Conference on Learning Representations. Vienna， Austria： ICLR： #291 ［DOI：10.48550/arXiv.2010.119d291http://dx.doi.org/10.48550/arXiv.2010.119d291］

Gao Y F， Xia W， Hu D D and Gao X. 2023. DeSAM： decoupling segment anything model for generalizable medical image segmentation ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2306.00499.pdfhttps://arxiv.org/pdf/2306.00499.pdf

Giannakis I， Bhardwaj A， Sam L and Leontidis G. 2023. Deep learning universal crater detection using segment anything model （SAM）［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.07764.pdfhttps://arxiv.org/pdf/2304.07764.pdf

He C M， Li K， Zhang Y C， Xu G X， Tang L X， Zhang Y L， Guo Z H and Li X. 2023a. Weakly-supervised concealed object segmentation with SAM-based pseudo labeling and multi-scale feature grouping//Proceedings of 2023 Annual Conference on Neural Information Processing Systems. New Orleans， USA： NeurIPS： #11003

He H B， Zhang J， Xu M Y， Liu J H， Du B and Tao D C. 2023b. Scalable mask annotation for video text spotting ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.01443.pdfhttps://arxiv.org/pdf/2305.01443.pdf

He K M， Chen X L， Xie S N， Li Y H， Doll􀅡r P and Girshick R. 2022. Masked autoencoders are scalable vision learners//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： 15979-15988 ［DOI： 10.1109/CVPR52688.2022.01553http://dx.doi.org/10.1109/CVPR52688.2022.01553］

Hendrycks D and Gimpel K. 2023. Gaussian error linear units （GELUs）［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/1606.08415.pdfhttps://arxiv.org/pdf/1606.08415.pdf

Hu C F， Xia T Y， Ju S H and Li X D. 2023a. When SAM meets medical images： an investigation of segment anything model （SAM） on multi-phase liver tumor segmentation ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.08506.pdfhttps://arxiv.org/pdf/2304.08506.pdf

Hu M Z， Li Y H and Yang X F. 2023b. SkinSAM： Empowering skin cancer segmentation with segment anything model ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.13973.pdfhttps://arxiv.org/pdf/2304.13973.pdf

Huang Z Z， Dai M L， Zhang Y， Zhang J P and Shan H M. 2023. Point，segment and count： a generalized framework for object counting ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2311.12386.pdfhttps://arxiv.org/pdf/2311.12386.pdf

Jia C， Yang Y F， Xia Y， Chen Y T， Parekh Z， Pham H， Le Q V， Sung Y H， Li Z and Duerig T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision//Proceedings of the 38th International Conference on Machine Learning. Virtual Event： PMLR ： 4904-4916 ［DOI： 10.48550/arXiv.2102.05918http://dx.doi.org/10.48550/arXiv.2102.05918］

Jiang T and Li X N. 2024. Segmentation of abdominal CT and cardiac MR images with multi scale visual attention. Journal of Image and Graphics， 29（1）： 268-279

蒋婷，李晓宁. 2024. 采用多尺度视觉注意力分割腹部CT和心脏MR图像. 中国图象图形学报， 29（1）： 268-279 ［DOI： 10.11834/jig.221032http://dx.doi.org/10.11834/jig.221032］

Jiang T P and Yang Y Q. 2023. Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.01275.pdfhttps://arxiv.org/pdf/2305.01275.pdf

Ke L， Ye M Q， Danelljan M， Liu Y F， Tai Y W， Tang C K and Yu F. 2023. Segment anything in high quality//Proceedings of 2023 Annual Conference on Neural Information Processing Systems. New Orleans， USA： NeurIPS： #1567

Kirillov A， Mintun E， Ravi N， Mao H Z， Rolland C， Gustafson L， Xiao T， Whitehead S， Berg A C， Lo W Y， Doll􀅡r P and Girshick R. 2023. Segment anything//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris， France： IEEE： 3992-4003 ［DOI： 10.1109/ICCV51070.2023.00371http://dx.doi.org/10.1109/ICCV51070.2023.00371］

Li F， Zhang H， Sun P， Zou X Y， Liu S L， Li C Y， Yang J W， Zhang L and Gao J F. 2023a. Semantic-SAM： segment and recognize anything at any granularity ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2307.04767.pdfhttps://arxiv.org/pdf/2307.04767.pdf

Li F， Zhang H， Xu H Z， Liu S L， Zhang L， Ni L M and Shum H Y. 2022. Mask DINO： towards a unified transformer-based framework for object detection and segmentation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver， Canada： IEEE： 3041-3050 ［DOI： 10.1109/CVPR52729.2023.00297http://dx.doi.org/10.1109/CVPR52729.2023.00297］

Li Y X， Jing B W， Li Z H， Wang J and Zhang Y. 2023b. nnSAM： plug-and-play segment anything model improves nnUNet performance ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2309.16967.pdfhttps://arxiv.org/pdf/2309.16967.pdf

Lin T Y， Goyal P， Girshick R， He K M and Doll􀅡r P. 2017. Focal loss for dense object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2999-3007 ［DOI： 10.1109/ICCV.2017.324http://dx.doi.org/10.1109/ICCV.2017.324］

Liu S H， Ye J W and Wang X C. 2023a. Any-to-any style transfer： making Picasso and Da Vinci collaborate ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.09728.pdfhttps://arxiv.org/pdf/2304.09728.pdf

Liu S Y， Chi J N， Wu C D and Xu F. 2023b. Recurrent slice networks-based 3D point cloud-relevant integrated segmentation of semantic and instances. Journal of Image and Graphics， 28（7）： 2135-2150

刘苏毅，迟剑宁，吴成东，徐方. 2023b. 基于递归切片网络的三维点云语义分割与实例分割. 中国图象图形学报， 28（7）： 2135-2150 ［DOI： 10.11834/jig.220154http://dx.doi.org/10.11834/jig.220154］

Liu Y H， Zhang J M， She Z C， Kheradmand A and Armand M. 2024. SAMM （segment any medical model）： a 3D slicer integration to SAM ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.05622.pdfhttps://arxiv.org/pdf/2304.05622.pdf

Liu Y Q， Kong L D， Cen J， Chen R N， Zhang W W， Pan L， Chen K and Liu Z W. 2023c. Segment any point cloud sequences by distilling vision foundation models//Proceedings of 2023 Annual Conference on Neural Information Processing Systems. New Orleans， USA： NeurIPS： #9347

Lu Z H， Xiao Z Y， Bai J W， Xiong Z W and Wang X C. 2023. Can SAM boost video super-resolution？［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.06524.pdfhttps://arxiv.org/pdf/2305.06524.pdf

Ma J， He Y T， Li F F， Han L， You C Y and Wang B. 2023a. Segment anything in medical images ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.12306.pdfhttps://arxiv.org/pdf/2304.12306.pdf

Ma Z H， Hong X P and Shangguan Q N. 2023b. Can SAM count anything？ An empirical study on SAM counting ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.10817.pdfhttps://arxiv.org/pdf/2304.10817.pdf

Milletari F， Navab N and Ahmadi S A. 2016. V-Net： fully convolutional neural networks for volumetric medical image segmentation//Proceedings of the 4th International Conference on 3D Vision. Stanford， USA： IEEE： 565-571 ［DOI： 10.1109/3DV.2016.79http://dx.doi.org/10.1109/3DV.2016.79］

Minaee S， Boykov Y， Porikli F， Plaza A， Kehtarnavaz N and Terzopoulos D. 2022. Image segmentation using deep learning： a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（7）： 3523-3542 ［DOI： 10.1109/TPAMI.2021.3059968http://dx.doi.org/10.1109/TPAMI.2021.3059968］

Mo S T and Tian Y P. 2023. AV-SAM： segment anything model meets audio-visual localization and segmentation ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.01836.pdfhttps://arxiv.org/pdf/2305.01836.pdf

Mohapatra S， Gosai A and Schlaug G. 2023. SAM vs BET： a comparative study for brain extraction and segmentation of magnetic resonance images using deep learning ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.04738.pdfhttps://arxiv.org/pdf/2304.04738.pdf

Radford A， Kim W J， Hallacy C， Ramesh A， Goh G， Agarwal S， Sastry G， Askell A， Mishkin P， Clark J， Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. PMLR 139： 8748-8763 ［DOI： 10.48550/arXiv.2103.00020http://dx.doi.org/10.48550/arXiv.2103.00020］

Ramesh A， Dhariwal P， Nichol A， Chu C and Chen M. 2022. Hierarchical text-conditional image generation with CLIP latents ［EB/OL］. ［2024-01-10］. https://3dvar.com/Ramesh2022Hierarchical.pdfhttps://3dvar.com/Ramesh2022Hierarchical.pdf

Roy S， Wald T， Koehler G， Rokuss M R， Disch N， Holzschuh J， Zimmerer D and Maier-Hein K H. 2023. SAM.MD： zero-shot medical image segmentation capabilities of the segment anything model ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.05396.pdfhttps://arxiv.org/pdf/2304.05396.pdf

Shaharabany T， Dahan A， Giryes R and Wolf L. 2023. AutoSAM： adapting SAM to medical images by overloading the prompt encoder//Proceedings of the 34th British Machine Vision Conference 2023. Aberdeen， UK： BMVC： 530-533

Shi Z L， Sun Y and Zhang M M. 2023. Training-free object counting with prompts ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2307.00038.pdfhttps://arxiv.org/pdf/2307.00038.pdf

Su W J， Zhu X Z， Cao Y， Li B， Lu L W， Wei F R and Dai J F. 2020. VL-BERT： pre-training of generic visual-linguistic representations//Proceedings of the 8th International Conference on Learning Representations. Addis Ababa， Ethiopia： ICLR： #8530

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser L and Polosukhin I. 2017. Attention is all you need!//Proceedings of 2017 Annual Conference on Neural Information Processing Systems. Long Beach， USA： NIPS： 5998-6008

Wang A， Islam M， Xu M Y， Zhang Y and Ren H L. 2023a. Sam meets robotic surgery： an empirical study in robustness perspective ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304674.pdfhttps://arxiv.org/pdf/2304674.pdf

Wang B， Aboah A， Zhang.14 and Bagci UZ Y. 2023b. GazeSAM： what you see is what you segment ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.13844.pdfhttps://arxiv.org/pdf/2304.13844.pdf

Wang D， Zhang J， Du B， Tao D C and Zhang L P. 2023c. Scaling-up remote sensing segmentation dataset with segment anything model ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.02034.pdfhttps://arxiv.org/pdf/2305.02034.pdf

Wang T， Zhang J R， Fei J J， Zheng H， Tang Y L， Li Z， Gao M Q and Zhao S S. 2023d. Caption anything： interactive image description with diverse multimodal controls ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.02677.pdfhttps://arxiv.org/pdf/2305.02677.pdf

Wang X， Chen G Y， Qian G W， Gao P C， Wei X Y， Wang Y W， Tian Y H and Gao W. 2023e. Large-scale multi-modal pre-trained models： a comprehensive survey. Machine Intelligence Research， 20（4）： 447-482 ［DOI： 10.1007/s11633-022-1410-8http://dx.doi.org/10.1007/s11633-022-1410-8］

Wang X L， Zhang X S， Cao Y， Wang W， Shen C H and Huang T J. 2023f. SegGPT： segmenting everything in context ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.03284.pdfhttps://arxiv.org/pdf/2304.03284.pdf

Wu J D. 2023. PromptUNet： toward interactive medical image segmentation ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.10300v1.pdfhttps://arxiv.org/pdf/2305.10300v1.pdf

Wu J D， Fu R， Fang H H， Liu Y P， Wang Z W， Xu Y W， Jin Y M and Arbel T. 2023. Medical SAM adapter： adapting segment anything model for medical image segmentation ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.12620v1.pdfhttps://arxiv.org/pdf/2304.12620v1.pdf

Xie D F， Wang R C， Ma J， Chen C， Lu H N， Yang D， Shi F B and Lin X D. 2023. Edit everything： a text-guided generative system for images editing ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.14006.pdfhttps://arxiv.org/pdf/2304.14006.pdf

Yan Y， Deng C， Li L， Zhu L K and Ye B. 2023. Survey of image semantic segmentation methods in the deep learning era. Journal of Image and Graphics， 28（11）： 3342-3362

严毅，邓超，李琳，朱凌坤，叶彪. 2023. 深度学习背景下的图像语义分割方法综述. 中国图象图形学报， 28（11）： 3342-3362 ［DOI： 10.11834/jig.220292http://dx.doi.org/10.11834/jig.220292］

Yang J Y， Gao M Q， Li Z， Gao S， Wang F J and Zheng F. 2023. Track anything： segment anything meets videos ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.11968.pdfhttps://arxiv.org/pdf/2304.11968.pdf

Yao L L， Zuo H B， Zheng G Z， Fu C H and Pan J. 2023. SAM-DA： UAV tracks anything at night with SAM-powered domain adaptation ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2307.01024.pdfhttps://arxiv.org/pdf/2307.01024.pdf

Yu T， Feng R S， Feng R Y， Liu J M， Jin X， Zeng W J and Chen Z B. 2023. Inpaint anything： segment anything meets image inpainting ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.06790.pdfhttps://arxiv.org/pdf/2304.06790.pdf

Yuan L， Chen D D， Chen Y L， Codella N， Dai X Y， Gao J F， Hu H D， Huang X D， Li B X， Li C Y， Liu C， Liu M C， Liu Z C， Lu Y M， Shi Y， Wang L J， Wang J F， Xiao B， Xiao Z， Yang J W， Zeng M， Zhou L W and Zhang P C. 2021. Florence： a new foundation model for computer vision ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2111.11432.pdfhttps://arxiv.org/pdf/2111.11432.pdf

Zhang C H， Liu L， Cui Y W， Huang G J， Lin W L， Yang Y Q and Hu Y H. 2023a. A comprehensive survey on segment anything model for vision and beyond ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.08196.pdfhttps://arxiv.org/pdf/2305.08196.pdf

Zhang C N， Han D S， Qiao Y， Kim J U， Bae S H， Lee S and Hong C S. 2023b. Faster segment anything： towards lightweight SAM for mobile applications ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2306.14289.pdfhttps://arxiv.org/pdf/2306.14289.pdf

Zhang J P， Pu J， Chen J， Fu H L， Tao Y Z， Wang S M， Chen Q， Xiao Y F， Chen S M， Cheng Y， Shan H M， Chen D W and Wang F Y. 2023c. DSiV： data science for intelligent vehicles. IEEE Transactions on Intelligent Vehicles， 8（4）： 2628-2634 ［DOI： 10.1109/TIV.2023.3264601http://dx.doi.org/10.1109/TIV.2023.3264601］

Zhang J P， Pu J， Xue J R， Yang M， Xu X， Wang X and Wang F Y. 2023d. HiVeGPT： human-machine-augmented intelligent vehicles with generative pre-trained Transformer. IEEE Transactions on Intelligent Vehicles， 8（3）： 2027-2033 ［DOI： 10.1109/TIV.2023.3256982http://dx.doi.org/10.1109/TIV.2023.3256982］

Zhang W K， Liu W J， Sun X， Xu G L and Fu K. 2022. Multi-source features adaptation fusion network for semantic segmentation in high-resolution remote sensing images. Journal of Image and Graphics， 27（8）： 2516-2526

张文凯，刘文杰，孙显，许光銮，付琨. 2022. 多源特征自适应融合网络的高分遥感影像语义分割. 中国图象图形学报， 27（8）： 2516-2526 ［DOI： 10.11834/jig.210054http://dx.doi.org/10.11834/jig.210054］

Zhang Y C and Jiao R S. 2023. Towards segment anything model （SAM） for medical image segmentation： a survey ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2305.03678.pdfhttps://arxiv.org/pdf/2305.03678.pdf

Zhang Y Z， Zhou T， Wang S， Liang P X， Zhang Y J and Chen D Z. 2023e. Input augmentation with SAM： boosting medical image segmentation with segmentation foundation model//Medical Image Computing and Computer Assisted Intervention. Vancouver， Canada： Springer： 129-139 ［DOI： 10.1007/978-3-031-47401-9_13http://dx.doi.org/10.1007/978-3-031-47401-9_13］

Zhao S L and Zhang Q. 2023. Progress in multi-modal image semantic segmentation based on deep learning. Journal of Image and Graphics， 28（11）： 3320-3341

赵什陆，张强. 2023. 深度学习多模态图像语义分割前沿进展. 中国图象图形学报， 28（11）： 3320-3341 ［DOI： 10.11834/jig.220451http://dx.doi.org/10.11834/jig.220451］

Zhao X， Ding W C， An Y Q， Du Y L， Yu T， Li M， Tang M and Wang J Q. 2023. Fast segment anything ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2306.12156.pdfhttps://arxiv.org/pdf/2306.12156.pdf

Zhou J， Ke P， Qiu X P， Huang M L and Zhang J P. 2024. ChatGPT： potential， prospects， and limitations. Frontiers of Information Technology and Electronic Engineering， 25（1）： 6-11. ［DOI： 10.1631/FITEE.2300089http://dx.doi.org/10.1631/FITEE.2300089］

Zhou T， Zhang Y Z， Zhou Y， Wu Y and Gong C. 2023. Can SAM segment polyps？［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2304.07583.pdfhttps://arxiv.org/pdf/2304.07583.pdf

Zhu J W， Chen Z Y， Hao Z Q， Chang S J， Zhang L， Wang D， Lu H C， Luo B， He J Y， Lan J P， Chen H Y and Li C Y. 2023. Tracking anything in high quality ［EB/OL］. ［2024-01-10］. https://arxiv.org/pdf/2307.13974.pdfhttps://arxiv.org/pdf/2307.13974.pdf

Zou X Y， Yang J W， Zhang H， Li F， Li L J， Wang J F， Wang L J， Gao J F and Lee Y J. 2023. Segment everything everywhere all at once//Proceedings of 2023 Annual Conference on Neural Information Processing Systems. New Orleans， USA： NeurIPS： #6718

Alert me when the article has been cited

提交

A Survey on Image Segmentation Using Active Contour and Level Set Method

Research progress of three-dimensional gait recognition

Comprehensive survey on 3D visual-language understanding techniques

Combination of latent diffusion and U-shaped networks for HIFU treatment target region extraction

Deep learning-based real-time semantic segmentation： a survey