深度学习实时语义分割综述

高常鑫; 徐正泽; 吴东岳; 余昌黔; 桑农

doi:10.11834/jig.230659

图像/视频语义分割 | 浏览量 : 0 下载量: 541 CSCD: 3

PDF
导出
分享
收藏
专辑

深度学习实时语义分割综述
Deep learning-based real-time semantic segmentation： a survey
2024年29卷第5期页码：1119-1145
收稿日期：2023-09-12，

修回日期：2023-12-26，

纸质出版日期：2024-05-16
DOI： 10.11834/jig.230659
稿件说明：

移动端阅览

高常鑫，徐正泽，吴东岳，余昌黔，桑农. 2024. 深度学习实时语义分割综述. 中国图象图形学报， 29(05):1119-1145 DOI： 10.11834/jig.230659.

Gao Changxin， Xu Zhengze， Wu Dongyue， Yu Changqian， Sang Nong. 2024. Deep learning-based real-time semantic segmentation： a survey. Journal of Image and Graphics， 29(05):1119-1145 DOI： 10.11834/jig.230659.

摘要

语义分割是计算机视觉领域的一项像素级别的感知任务，目的是为图像中的每个像素分配相应类别标签，具有广泛应用。许多语义分割网络结构复杂，计算量和参数量较大，在对高分辨率图像进行像素层次的理解时具有较大的延迟，这极大限制了其在资源受限环境下的应用，如自动驾驶、辅助医疗和移动设备等。因此，实时推理的语义分割网络得到了广泛关注。本文对深度学习中实时语义分割算法进行了全面论述和分析。1）介绍了语义分割和实时语义分割任务的基本概念、应用场景和面临问题；2）详细介绍了实时语义分割算法中常用的技术和设计，包括模型压缩技术、高效卷积神经网络（convolutional neural network，CNN）模块和高效Transformer模块；3）全面整理和归纳了现阶段的实时语义分割算法，包括单分支网络、双分支网络、多分支网络、U型网络和神经架构搜索网络5种类别的实时语义分割方法，涵盖基于CNN、基于Transformer和基于混合框架的分割网络，并分析了各类实时语义分割算法的特点和局限性；4）提供了完整的实时语义分割评价体系，包括相关数据集和评价指标、现有方法性能汇总以及领域主流方法的同设备比较，为后续研究者提供统一的比较标准；5）给出结论并分析了实时语义分割领域仍存在的挑战，对实时语义分割领域未来可能的研究方向提出了相应见解。本文提及的算法、数据集和评估指标已汇总至

https://github.com/xzz777/Awesome-Real-time-Semantic-Segmentation

，以便后续研究者使用。

Abstract

Semantic segmentation is a fundamental task in the field of computer vision， which aims to assign a category label to each pixel in the input image. Many semantic segmentation networks have complex structures， high computational costs， and massive parameters. As a result， they introduce considerable latency when performing pixel-level scene understanding on high-resolution images. These limitations greatly restrict the applicability of these methods in resource-constrained scenarios， such as autonomous driving， medical applications， and mobile devices. Therefore， real-time semantic segmentation methods， which produce high-precision segmentation masks with fast inference speeds， receive widespread attention. This study provides a systematic and critical review of real-time semantic segmentation algorithms based on deep learning techniques to explore the development of real-time semantic segmentation in recent years. Moreover， it covers three key aspects of real-time semantic segmentation： real-time semantic segmentation networks， mainstream datasets， and common evaluation indicators. In addition， this study conducts a quantitative evaluation of the real-time semantic segmentation methods discussed and provides some insights into the future development in this field. First， semantic segmentation and real-time semantic segmentation tasks and their application scenarios and challenges are introduced. The key challenge in real-time semantic segmentation mainly lies on how to extract high-quality semantic information with high efficiency. Second， some preliminary knowledge for studying real-time semantic segmentation algorithms is introduced in detail. Specifically， this study introduces four kinds of general model compression methods： network pruning， neural architecture search， knowledge distillation， and parameter quantification. It also introduces some popular efficient CNN modules in real-time semantic segmentation networks， such as MobileNet， ShuffleNet，

EfficientNet， and efficient Transformer modules， such as external attention， SeaFormer， and MobileViT. Then， existing real-time semantic segmentation algorithms are organized and summarized. In accordance with the characteristics of the overall network structure， existing works are categorized into five categories： single-branch， two-branch， multibranch， U-shaped， and neural architecture search networks. Specifically， the encoder of a single-branch network is a single-branch hierarchical backbone network， and its decoder is usually lightweight and does not involve complex fusion of multiscale features. The two-branch network adopts a two-branch encoder structure， using one branch to capture spatial detail information and the other branch to model semantic context information. Multibranch networks are characterized by a multibranch structure in the encoder part of the network or a network with multiresolution inputs， where the input of each resolution passes through a different subnetwork. The U-shaped network has a contracting encoder and an expansive decoder， which are roughly symmetrical to the encoder. Most works of these aforementioned four categories are manually designed， while the neural architecture search networks are obtained using network architecture search technology based on the four types of architectures. These five categories of real-time semantic segmentation methods cover almost all real-time semantic segmentation algorithms based on deep learning， including CNN-based， Transformer-based， and hybrid-architecture-based segmentation networks. Moreover， commonly used datasets and evaluation indicators of accuracy， speed， and model size are introduced for real-time segmentation. We divided popular datasets into the autonomous driving scene and general scene datasets， and the evaluation indicators are divided into accuracy indicators and efficiency descriptors. In addition， this study quantitatively evaluates various real-time semantic segmentation algorithms mentioned on multiple datasets by using r

elevant evaluation indicators. To avoid the interference of different devices when conducting a quantitative comparison between real-time semantic segmentation algorithms， this study compares the performance of advanced methods of each category with the same devices and configuration and establishes a fair and integral real-time semantic segmentation evaluation system for subsequent research， thereby contributing to a unified standard for comparison. Finally， current challenges in real-time semantic segmentation are discussed， and possible future directions for improvements are envisioned （e.g.， utilization of Transformers， applications on edge devices， knowledge transfer of visual foundation models， diversity of evaluation indicators， variety of datasets， utilization of multimodal data and weakly supervised methods， combination with incremental learning）. The algorithms， datasets， and evaluation indicators mentioned in this paper are summarized at

https://github.com/xzz777/Awesome-Real-time-Semantic-Segmentation

for the convenience of subsequent researchers.

关键词

Keywords

references

Badrinarayanan V ， Kendall A and Cipolla R . 2017 . SegNet： a deep convolutional encoder-decoder architecture for image segmentation . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 39 （ 12 ）： 2481 - 2495 ［ DOI： 10.1109/TPAMI.2016.2644615 http://dx.doi.org/10.1109/TPAMI.2016.2644615 ］

Boykov Y ， Veksler O and Zabih R . 2001 . Fast approximate energy minimization via graph cuts . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 23 （ 11 ）： 1222 - 1239 ［ DOI： 10.1109/34.969114 http://dx.doi.org/10.1109/34.969114 ］

Brostow G J ， Fauqueur J and Cipolla R . 2009 . Semantic object classes in video： a high-definition ground truth database . Pattern Recognition Letters ， 30 （ 2 ）： 88 - 97 ［ DOI： 10.1016/j.patrec.2008.04.005 http://dx.doi.org/10.1016/j.patrec.2008.04.005 ］

Caesar H ， Uijlings J and Ferrari V . 2018 . COCO-stuff： thing and stuff classes in context // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 1209 - 1218 ［ DOI： 10.1109/CVPR.2018.00132 http://dx.doi.org/10.1109/CVPR.2018.00132 ］

Chen W Y ， Gong X Y ， Liu X M ， Zhang Q ， Li Y and Wang Z Y . 2020 . FasterSeg： searching for faster real-time semantic segmentation ［EB/OL］. ［ 2023-08-27 ］. https://arxiv.org/pdf/1912.10917.pdf https://arxiv.org/pdf/1912.10917.pdf

Chen Y P ， Dai X Y ， Chen D D ， Liu M C ， Dong X Y ， Yuan L and Liu Z C . 2022 . Mobile-former： bridging mobilenet and Transformer // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 5260 - 5269 ［ DOI： 10.1109/CVPR52688.2022.00520 http://dx.doi.org/10.1109/CVPR52688.2022.00520 ］

Cheng B W ， Misra I ， Schwing A G ， Kirillov A and Girdhar R . 2022 . Masked-attention mask Transformer for universal image segmentation // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 1280 - 1289 ［ DOI： 10.1109/CVPR52688.2022.00135 http://dx.doi.org/10.1109/CVPR52688.2022.00135 ］

Cheng B W ， Schwing A G and Kirillov A . 2021 . Per-pixel classification is not all you need for semantic segmentation ［EB/OL］. ［ 2023-08-27 ］. https://arxiv.org/pdf/2107.06278.pdf https://arxiv.org/pdf/2107.06278.pdf

文章被引用时，请邮件提醒。

提交

针对视觉深度学习模型的物理对抗攻击研究综述

“三维视觉—语言”推理技术的前沿研究与最新趋势

深度学习实时语义分割研究进展和挑战

知识蒸馏方法研究与应用综述

无人机视角下的目标检测研究进展