结合背景图的高分辨率视频人像实时抠图网络

彭泓; 张家宝; 贾迪; 安彤; 蔡鹏; 赵金源

doi:10.11834/jig.230174

图像分析和识别 | 浏览量 : 0 下载量: 7 CSCD: 0

PDF
导出
分享
收藏
专辑

结合背景图的高分辨率视频人像实时抠图网络
Real-time high-resolution video portrait matting network combined with background image
2024年29卷第2期页码：478-490
纸质出版日期： 2024-02-16 ，
DOI： 10.11834/jig.230174
稿件说明：

移动端阅览

彭泓，张家宝，贾迪，安彤，蔡鹏，赵金源. 2024. 结合背景图的高分辨率视频人像实时抠图网络. 中国图象图形学报， 29(02):0478-0490

Peng Hong， Zhang Jiabao， Jia Di， An Tong， Cai Peng， Zhao Jinyuan. 2024. Real-time high-resolution video portrait matting network combined with background image. Journal of Image and Graphics， 29(02):0478-0490
彭泓，张家宝，贾迪，安彤，蔡鹏，赵金源. 2024. 结合背景图的高分辨率视频人像实时抠图网络. 中国图象图形学报， 29(02):0478-0490 DOI： 10.11834/jig.230174.

Peng Hong， Zhang Jiabao， Jia Di， An Tong， Cai Peng， Zhao Jinyuan. 2024. Real-time high-resolution video portrait matting network combined with background image. Journal of Image and Graphics， 29(02):0478-0490 DOI： 10.11834/jig.230174.

摘要

目的

近年来，采用神经网络完成人像实时抠图已成为计算机视觉领域的研究热点，现有相关网络在处理高分辨率视频时还无法满足实时性要求，为此本文提出一种结合背景图的高分辨率视频人像实时抠图网络。

方法

给出一种由基准网络和精细化网络构成的双层网络，在基准网络中，视频帧通过编码器模块提取图像的多尺度特征，采用金字塔池化模块融合这些特征作为循环解码器网络的输入；在循环解码器中，通过残差门控循环单元聚合连续视频帧间的时间信息，以此生成蒙版图、前景残差图和隐藏特征图，采用残差结构降低模型参数量并提高网络的实时性。为提高高分辨率图像实时抠图性能，在精细化网络中，设计高分辨率信息指导模块，通过高分辨率图像信息指导低分辨率图像的方式生成高质量人像抠图结果。

结果

与近年来的相关网络模型进行实验对比，实验结果表明，本文方法在高分辨率数据集Human2K上优于现有相关方法，在评价指标（绝对误差、均方误差、梯度、连通性）上分别提升了18.8%、39.2%、40.7%、20.9%。在NVIDIA GTX 1080Ti GPU上处理4 K分辨率影像运行速率可达26帧/s，处理HD（high definition）分辨率影像运行速率可达43帧/s。

结论

本文模型能够更好地完成高分辨率人像实时抠图任务，可以为影视、短视频社交以及网络会议等高级应用提供更好的支持。

Abstract

Objective

Video matting is one of the most commonly used operations in visual image processing. It aims to separate a certain part of an image from the original image into a separate layer and further apply it to specific scenes for later video synthesis. In recent years， real-time portrait matting that uses neural networks has become a research hotspot in the field of computer vision. Existing related networks cannot meet real-time requirements when processing high-resolution video. Moreover， the matting results at the edges of high-resolution image targets still have blurry issues. To solve these problems， several recently proposed methods that use various auxiliary information to guide high-resolution image for mask estimation have demonstrated good performance. However， many methods cannot perfectly learn information about the edges and details of portraits. Therefore， this study proposes a high-resolution video real-time portrait matting network combined with background images.

Method

A double-layer network composed of a base network and a refinement network is presented. To achieve a lightweight network， high-resolution feature maps are first downsampled at sampling rate D. In the base network， the multi-scale features of video frames are extracted by the encoder module， and these features are fused by the pyramid pooling module， because the input of the cyclic decoder network is beneficial for the cyclic decoder to learn the multi-scale features of video frames. In the cyclic decoder， a residual gated recurrent unit （GRU） is used to aggregate the time information between consecutive video frames. The masked map， foreground residual map， and hidden feature map are generated. A residual structure is used to reduce model parameters and improve the real-time performance of the network. In the residual GRU， the time information of the video is fully utilized to promote the construction of the masked map of the video frame sequence based on time information. To improve the real-time matting performance of high-resolution images， the high-resolution information guidance module designed in the refinement network， and the initial high-resolution video frames and low-resolution predicted features （masked map， foreground residual map， and hidden feature map） are used as input to pass the high-resolution information guidance module， generating high-quality portrait matting results by guiding low-resolution images with high-resolution image information. In the high-resolution information guidance module， the combination of covariance means filtering， variance means filtering， and pointwise convolution processing can effectively extract the matting quality of the detailed areas of character contours in a high-resolution video frame. Under the synergistic effects of the benchmark and refinement networks， the designed network cannot only fully extract multi-scale information from low-resolution video frames， but can also more fully learn the edge information of portraits in high-resolution video frames. This condition is conducive to more accurate prediction of masked maps and foreground images in the network structure and can also improve the generalization ability of the matting network at multiple resolutions. In addition， the high-resolution image downsampling scheme， lightweight pyramid pooling module， and residual link structure designed in the network further reduce the number of network parameters， improving the real-time performance of the network.

Result

We use PyTorch to implement our network on NVIDIA GTX 1080Ti GPU with 11 GB RAM. Batch size is 1， and the optimizer used is Adam. This study trains the benchmark network on three datasets in sequence： the Video240K SD dataset， with an input frame sequence of 15. After 8 epochs of training， the fine network is trained on the Video240K HD dataset for 1 epoch. To improve the robustness of the model in processing high-resolution videos， the refinement network was further trained on the Human2K dataset， with a downsampling rate D of 0.25 and an input frame sequence of 2 for 50 epochs of training. Compared with related network models in recent years， the experimental results show that the proposed method is superior to other methods on the Video240K SD dataset and the Human2K dataset. On the Video240K SD dataset， 26.1%， 50.6%， 56.9%， and 39.5% of the evaluation indicators （sum of absolute difference（SAD）， mean squared error（MSE）， gradient error（Grad）， and connectivity error（Coon）） were optimized， respectively. In particular， on the high-resolution Human2K dataset， the proposed method is significantly superior to other state-of-the-art methods， optimizing the evaluation indicators （SAD， MSE， Grad， and Coon） by 18.8%， 39.2%， 40.7%， and 20.9%， respectively. Simultaneously achieving the lowest network complexity at 4 K resolution （28.78 GMac）. The running speed of processing low-resolution video （512 × 288 pixels） can reach 49 frame/s， and the running speed of processing medium-resolution video （1 024 × 576 pixels） can reach 42.4 frame/s. In particular， the running speed of processing 4 K resolution video can reach 26 frame/s， while the running speed of processing HD-resolution video can reach 43 frame/s on NVIDIA GTX 1080Ti GPU. This value is significantly improved compared with other state-of-the-art methods.

Conclusion

The network model proposed in this study can better complete the real-time matting task of high-resolution portraits. The pyramid pooling module in the benchmark network effectively extracts and integrates multi-scale information of video frames， while the residual GRU module significantly aggregates continuous inter-frame time information. The high-resolution information guidance module captures high-resolution information in images and guides low-resolution images to learn high-resolution information. The improved network effectively enhances the matting information of high-resolution human-oriented edges. The experiments on the high-resolution dataset Human2K show that the proposed network is more effective in predicting high-resolution montage maps. It has high real-time processing speed and can provide better support for advanced applications， such as film and television， short video social networking， and online conference.

关键词

人像实时抠图神经网络多尺度特征时间信息高分辨率

Keywords

real-time human figure mattingneural networkmultiscale featurestime informationhigh resolution

references

Aksoy Y， Aydin T O and Pollefeys M. 2017. Designing effective inter-pixel information flow for natural image matting//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 228-236 ［DOI： 10.1109/CVPR.2017.32http://dx.doi.org/10.1109/CVPR.2017.32］

Chen G W， Liu Y， Wang J， Peng J C， Hao Y Y， Chu L T， Tang S Y， Wu Z W， Chen Z Y， Yu Z L， Du Y N， Dang Q Q， Hu X G and Yu D H. 2022. PP-matting： high-accuracy natural image matting ［EB/OL］［2023-04-03］. https://arxiv.org/pdf/2204.09433.pdfhttps://arxiv.org/pdf/2204.09433.pdf

Chen Q F， Li D Z Y and Tang C K. 2013a. KNN matting. IEEE Transactions on Pattern Analysis and Machine Intelligence， 35（9）： 2175-2188 ［DOI： 10.1109/TPAMI.2013.18http://dx.doi.org/10.1109/TPAMI.2013.18］

Chen X W， Zou D Q， Zhou S Z， Zhao Q P and Tan P. 2013b. Image matting with local and nonlocal smooth priors//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland， USA： IEEE： 1902-1907 ［DOI： 10.1109/CVPR.2013.248http://dx.doi.org/10.1109/CVPR.2013.248］

Chuang Y Y， Curless B， Salesin D H and Szeliski R. 2001. A Bayesian approach to digital matting//Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Kauai， USA： IEEE： 1063-1069 ［DOI： 10.1109/CVPR.2001.990970http://dx.doi.org/10.1109/CVPR.2001.990970］

He K M， Rhemann C， Rother C， Tang X O and Sun J. 2011. A global sampling method for alpha matting//Proceedings of CVPR 2011. Colorado Springs， USA： IEEE： 2049-2056 ［DOI： 10.1109/CVPR.2011.5995495http://dx.doi.org/10.1109/CVPR.2011.5995495］

He K M， Sun J and Tang X O. 2010. Fast matting using large kernel matting Laplacian matrices//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco， USA： IEEE： 2165-2172 ［DOI： 10.1109/CVPR.2010.5539896http://dx.doi.org/10.1109/CVPR.2010.5539896］

Hong X， Yang Y Y and Wen S H. 2018. Improving comprehensive sampling sets matting using texture feature//Proceedings of 2018 IEEE 4th International Conference on Computer and Communications （ICCC）. Chengdu， China： IEEE： 1617-1621 ［DOI： 10.1109/CompComm.2018.8780703http://dx.doi.org/10.1109/CompComm.2018.8780703］

Hou Q Q and Liu F. 2019. Context-aware image matting for simultaneous foreground and alpha estimation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul， Korea （South）： IEEE： 4129-4138 ［DOI： 10.1109/ICCV.2019.00423http://dx.doi.org/10.1109/ICCV.2019.00423］

Ioffe S and Szegedy C. 2015. Batch normalization： accelerating deep network training by reducing internal covariate shift ［EB/OL］. ［2015-03-02］. https://arxiv.org/1502.03167.pdfhttps://arxiv.org/1502.03167.pdf

Jin Y， Li Z X， Zhu D M， Shi M and Wang Z Q. 2022. Automatic and real-time green screen keying. The Visual Computer， 38（9）： 3135-3147 ［DOI： 10.1007/s00371-022-02542-xhttp://dx.doi.org/10.1007/s00371-022-02542-x］

Lee P and Wu Y. 2011. Nonlocal matting//Proceedings of CVPR 2011. Colorado Springs， USA： IEEE： 2193-2200 ［DOI： 10.1109/CVPR.2011.5995665http://dx.doi.org/10.1109/CVPR.2011.5995665］

Levin A， Rav-Acha A and Lischinski D. 2008. Spectral matting. IEEE Transactions on Pattern Analysis and Machine Intelligence， 30（10）： 1699-1712 ［DOI：10.1109/TPAMI.2008.168http://dx.doi.org/10.1109/TPAMI.2008.168］

Lin S C， Ryabtsev A， Sengupta S， Curless B， Seitz S and Kemelmacher-Shlizerman I. 2021. Real-time high-resolution background matting//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 8758-8767 ［DOI： 10.1109/CVPR46437.2021.00865http://dx.doi.org/10.1109/CVPR46437.2021.00865］

Lin S C， Yang L J， Saleemi I and Sengupta S. 2022. Robust high-resolution video matting with temporal guidance//Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision （WACV）. Waikoloa， USA： IEEE： 3132-3141 ［DOI： 10.1109/WACV51458.2022.00319http://dx.doi.org/10.1109/WACV51458.2022.00319］

Liu T Y， Qiu J， He D and Liu C. 2022. Light field alpha matting based on spatial-angular consistency. Acta Optica Sinica， 42（16）： #1612003

刘天艺，邱钧，何迪，刘畅. 2022. 基于空角一致性的光场抠图. 光学学报， 42（16）： #1612003 ［DOI： 10.3788/AOS202242.1612003http://dx.doi.org/10.3788/AOS202242.1612003］

Liu Y H， Xie J K， Shi X， Qiao Y， Huang Y J， Tang Y and Yang X. 2021. Tripartite information mining and integration for image matting//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Montreal， Canada： IEEE： 7535-7544 ［DOI： 10.1109/ICCV48922.2021.00746http://dx.doi.org/10.1109/ICCV48922.2021.00746］

Liu Q L， Zhang S P， Meng Q L， Li R， Zhong B N and Nie L Q. 2023. Rethinking context aggregation in natural image matting ［EB/OL］. ［2023-04-03］. https://arxiv.org/pdf/2304.01171.pdfhttps://arxiv.org/pdf/2304.01171.pdf

Park G， Son S， Yoo J， Kim S and Kwak N. 2022. MatteFormer： Transformer-based image matting via prior-tokens//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. New Orleans， USA： IEEE： 11686-11696 ［DOI： 10.1109/CVPR52688.2022.01140http://dx.doi.org/10.1109/CVPR52688.2022.01140］

Sengupta S， Jayaram V， Curless B， Seitz S M and Kemelmacher-Shlizerman I. 2020. Background matting： the world is your green screen//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 2288-2297 ［DOI： 10.1109/CVPR42600.2020.00236http://dx.doi.org/10.1109/CVPR42600.2020.00236］

Seong H， Oh S W， Price B， Kim E and Lee J Y. 2022. One-Trimap video matting ［EB/OL］［2023-04-03］. https://arxiv.org/pdf/2207.13353.pdfhttps://arxiv.org/pdf/2207.13353.pdf

Shahrian E， Rajan D， Price B and Cohen S. 2013. Improving image matting using comprehensive sampling sets//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland， USA： IEEE： 636-643 ［DOI： 10.1109/CVPR.2013.88http://dx.doi.org/10.1109/CVPR.2013.88］

Song S F. 2022. Attention-based Memory video portrait matting ［EB/OL］［2023-04-03］. https://arxiv.org/pdf/2203.06890.pdfhttps://arxiv.org/pdf/2203.06890.pdf

Sun Y N， Tang C K and Tai Y W. 2021a. Semantic image matting//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 11115-11124 ［DOI： 10.1109/CVPR46437.2021.01097http://dx.doi.org/10.1109/CVPR46437.2021.01097］

Sun Y N， Wang G Z， Gu Q， Tang C K and Tai Y W. 2021b. Deep video matting via spatio-temporal alignment and aggregation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 6971-6980 ［DOI： 10.1109/CVPR46437.2021.00690http://dx.doi.org/10.1109/CVPR46437.2021.00690］

Wang J and Cohen M F. 2007. Optimized color sampling for robust matting//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis， USA： IEEE： 1-8 ［DOI： 10.1109/CVPR.2007.383006http://dx.doi.org/10.1109/CVPR.2007.383006］

Wu H K， Zheng S， Zhang J G and Huang K Q. 2018. Fast end-to-end trainable guided filter//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 1838-1847 ［DOI： 10.1109/CVPR.2018.00197http://dx.doi.org/10.1109/CVPR.2018.00197］

Wu Y E， He F Z， Zhang S L， Chen Z and Huang Z Y. 2010. A simple stroke-based iterative image matting approach. Journal of Image and Graphics， 15（12）： 1769-1775

吴玉娥，何发智，张胜龙，陈钊，黄志勇. 2010. 一种基于简单笔画交互的迭代图像抠图方法. 中国图象图形学报， 15（12）： 1769-1775 ［DOI： 10.11834/jig.20101206http://dx.doi.org/10.11834/jig.20101206］

Xu N， Price B， Cohen S and Huang T. 2017. Deep image matting//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 311-320 ［DOI： 10.1109/CVPR.2017.41http://dx.doi.org/10.1109/CVPR.2017.41］

Xu Z B and Yang Y J. 2020. Fast portrait automatic matting based on multi-task deep learning. Engineering Journal of Wuhan University， 53（8）： 740-745， 752

许征波，杨煜俊. 2020. 基于多任务深度学习的快速人像自动抠图. 武汉大学学报（工学版）， 53（8）： 740-745， 752 ［DOI： 10.14188/j.1671-8844.2020-08-013http://dx.doi.org/10.14188/j.1671-8844.2020-08-013］

Yu Q H， Zhang J M， Zhang H， Wang Y L， Lin Z， Xu N， Bai Y T and Yuille A. 2021. Mask guided matting via progressive refinement network//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 1154-1163 ［DOI： 10.1109/CVPR46437.2021.00121http://dx.doi.org/10.1109/CVPR46437.2021.00121］

Zhang Y K， Wang C， Cui M M， Ren P R， Xie X S， Hua X S， Bao H J， Huang Q X and Xu W W. 2021. Attention-guided temporally coherent video object matting//Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event， China： ACM： 5128-5137 ［DOI： 10.1145/3474085.3475623http://dx.doi.org/10.1145/3474085.3475623］

Zhao H S， Shi J P， Qi X J， Wang X G and Jia J Y. 2017. Pyramid scene parsing network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 6230-6239 ［DOI： 10.1109/CVPR.2017.660http://dx.doi.org/10.1109/CVPR.2017.660］

文章被引用时，请邮件提醒。

提交