面向图像内补与外推问题的迭代预测统一框架

郭冬升; 顾肇瑞; 郑冰; 董军宇; 郑海永

doi:10.11834/jig.230144

图像理解和计算机视觉 | 浏览量 : 0 下载量: 228 CSCD: 1

PDF
导出
分享
收藏
专辑

面向图像内补与外推问题的迭代预测统一框架
Unified framework with iterative prediction for image inpainting and outpainting
2024年29卷第2期页码：491-505
收稿日期：2023-04-03，

修回日期：2023-07-05，

纸质出版日期：2024-02-16
DOI： 10.11834/jig.230144
稿件说明：

移动端阅览

郭冬升，顾肇瑞，郑冰，董军宇，郑海永. 2024. 面向图像内补与外推问题的迭代预测统一框架. 中国图象图形学报， 29(02):0491-0505 DOI： 10.11834/jig.230144.

Guo Dongsheng， Gu Zhaorui， Zheng Bing， Dong Junyu， Zheng Haiyong. 2024. Unified framework with iterative prediction for image inpainting and outpainting. Journal of Image and Graphics， 29(02):0491-0505 DOI： 10.11834/jig.230144.

摘要

目的

图像内补与外推可看做根据已知区域绘制未知区域的问题，是计算机视觉领域研究热点。近年来，深度神经网络成为解决内补与外推问题的主流方法。然而，当前解决方法多分别对待内补与外推问题，导致二者难以统一处理；且模型多采用卷积神经网络（convolutional neural network，CNN）构建，受到视野局部性限制，较难绘制远距离内容。针对这两个问题，本文按照分而治之思想联合CNN与Transformer构建深度神经网络，提出图像内补与外推统一处理框架及模型。

方法

将内补与外推问题的解决过程分解为“表征、预测、合成”3个部分，表征与合成采用CNN完成，充分利用其局部相关性进行图像到特征映射和特征到图像重建；核心预测由Transformer实现，充分发挥其强大的全局上下文关系建模能力，并提出掩膜自增策略迭代预测特征，降低Transformer同时预测大范围未知区域特征的难度；最后引入对抗学习提升绘制图像逼真度。

结果

实验给出在多种数据集下内补与外推对比评测，结果显示本文方法各项性能指标均超越对比方法。通过消融实验发现，模型相比采用非分解方式具有更佳表现，说明分而治之思路功效显著。此外，对掩膜自增策略进行详细的实验分析，表明迭代预测方法可有效提升绘制能力。最后，探究了Transformer关键结构参数对模型性能的影响。

结论

本文提出一种迭代预测统一框架解决图像内补与外推问题，相较对比方法性能更佳，并且各部分设计对性能提升均有贡献，显示了迭代预测统一框架及方法在图像内补与外推问题上的应用价值与潜力。

Abstract

Objective

Image inpainting and outpainting tasks are significant challenges in the field of computer vision. They involve the filling of unknown regions in an image on the basis of information available in known regions. With its advancements， deep learning has become the mainstream approach for dealing with these tasks. However， existing solutions frequently regard inpainting and outpainting as separate problems， and thus， they lack the ability to adapt seamlessly between the two. Furthermore， convolutional neural networks （CNNs） are commonly used in these methods， but their limitation in capturing long-range content due to locality poses challenges. To address these issues， this study proposes a unified framework that combines CNN and Transformer models on the basis of a divide-and-conquer strategy， aiming to deal with image inpainting and outpainting effectively.

Method

Our proposed approach consists of three stages： representation， prediction， and synthesis. In the representation stage， CNNs are employed to map the input images to a set of meaningful features. This step leverages the local information processing capability of CNNs and enables the extraction of relevant features from the known regions of an image. We use a CNN encoder that incorporates partial convolutions and pixel normalization to reduce the introduction of irrelevant information from unknown regions. The extracted features obtained are then passed to the prediction stage. In the prediction stage， we utilize the Transformer architecture， which excels in modeling global context， to generate predictions for the unknown regions of an image. The Transformer has been proven to be highly effective in capturing long-range dependencies and contextual information in various domains， such as natural language processing. By incorporating a Transformer， we aim to enhance the model’s ability to predict accurate and coherent content for inpainting and outpainting tasks. To address the challenge of predicting features for large-range unknown regions in parallel， we introduce a mask growth strategy. This strategy facilitates iterative feature prediction， wherein the model progressively predicts features for larger regions by gradually expanding the inpainting or outpainting task. This iterative process helps the model refine its predictions and capture more related contextual information， leading to improved results. Finally， we reconstruct the complete image in the synthesis stage by combining the predicted features with the known features from the representation stage. This synthesis aims to generate visually appealing and realistic results by leveraging the strengths of a CNN decoder that consists of multiple convolution residual blocks. Upsampling intervals are utilized， reducing the difficulty of model optimization.

Result

To evaluate the effectiveness of our proposed method， we conduct comprehensive experiments on diverse datasets that encompass objects and scenes for image inpainting and outpainting tasks. We compare our approach with state-of-the-art methods and utilize various evaluation metrics， including structural similarity index measure， peak signal-to-noise ratio， and perceptual quality metrics. The experimental results demonstrate that our unified framework surpasses existing methods across all evaluation metrics， demonstrating its superior performance. The combination of CNNs and a Transformer allows our model to capture local details and long-range dependencies， resulting in more accurate and visually appealing inpainting and outpainting results. In addition， ablation studies are conducted to confirm the effectiveness of each component of our method， including the framework structure and the mask growth strategy. Through ablation experiments， all three stages are confirmed to contribute to performance improvement， highlighting the applicability of our method. Furthermore， we empirically investigate the effect of the head and layer numbers of the Transformer model on overall performance， revealing that appropriate numbers of iterations， Transformer heads， and Transformer layers can further enhance the framework’s performance.

Conclusion

This study introduces an iterative prediction unified framework for addressing image inpainting and outpainting challenges. Our proposed method outperforms existing approaches in terms of performance， with each aspect of the design contributing to overall improvement. The combination of CNNs and a Transformer enables our model to capture the local and global contexts， leading to more accurate and visually coherent image inpainting and outpainting results. These findings underscore the practical value and potential of an iterative prediction unified framework and method in the field of image inpainting and outpainting. Future research directions include exploring the application of our framework to other related tasks and further optimizing the model architecture for enhanced efficiency and scalability. Moreover， an important aspect that can be explored to enhance our proposed framework is the integration of self-supervised learning techniques with large-scale datasets. This step can potentially improve the robustness and generalization capability of our model for image inpainting and outpainting tasks.

关键词

Keywords

references

Barnes C ， Shechtman E ， Finkelstein A and Goldman D B . 2009 . PatchMatch： a randomized correspondence algorithm for structural image editing . ACM Transactions on Graphics ， 28 （ 3 ）： # 24 ［ DOI： 10.1145/1531326.1531330 http://dx.doi.org/10.1145/1531326.1531330 ］

Carion N ， Massa F ， Synnaeve G ， Usunier N ， Kirillov A and Zagoruyko S . 2020 . End-to-end object detection with Transformers // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 213 - 229 ［ DOI： 10.1007/978-3-030-58452-8_13 http://dx.doi.org/10.1007/978-3-030-58452-8_13 ］

Chen H T ， Wang Y H ， Guo T Y ， Xu C ， Deng Y P ， Liu Z H ， Ma S W ， Xu C J ， Xu C and Gao W . 2021 . Pre-trained image processing Transformer // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 12294 - 12305 ［ DOI： 10.1109/CVPR46437.2021.01212 http://dx.doi.org/10.1109/CVPR46437.2021.01212 ］

Chen M ， Radford A ， Child R ， Wu J ， Jun H ， Luan D and Sutskever I . 2020 . Generative pretraining from pixels // Proceedings of the 37th International Conference on Machine Learning . Virtual， Online ： JMLR.org： 1691 - 1703

Cordts M ， Omran M ， Ramos S ， Rehfeld T ， Enzweiler M ， Benenson R ， Franke U ， Roth S and Schiele B . 2016 . The cityscapes dataset for semantic urban scene understanding // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 3213 - 3223 ［ DOI： 10.1109/CVPR.2016.350 http://dx.doi.org/10.1109/CVPR.2016.350 ］

Dong H Y ， Liang X D ， Zhang Y X ， Zhang X J ， Shen X H ， Xie Z Y ， Wu B W and Yin J . 2020 . Fashion editing with adversarial parsing learning // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 8117 - 8125 ［ DOI： 10.1109/CVPR42600.2020.00814 http://dx.doi.org/10.1109/CVPR42600.2020.00814 ］

Dosovitskiy A ， Beyer L ， Kolesnikov A ， Weissenborn D ， Zhai X H ， Unterthiner T ， Dehghani M ， Minderer M ， Heigold G ， Gelly S ， Uszkoreit J and Houlsby N . 2021 . An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［ 2023-04-03 ］. https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf

Guo D S ， Liu H Z ， Zhao H R ， Cheng Y H ， Song Q W ， Gu Z R ， Zheng H Y and Zheng B . 2020 . Spiral generative network for image extrapolation // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 701 - 717 ［ DOI： 10.1007/978-3-030-58529-7_41 http://dx.doi.org/10.1007/978-3-030-58529-7_41 ］

Guo Z H ， Gu Z R ， Zheng B ， Dong J Y and Zheng H Y . 2022 . Transformer for image harmonization and beyond . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 45 （ 11 ）： 12960 - 12977 ［ DOI： 10.1109/TPAMI.2022.3207091 http://dx.doi.org/10.1109/TPAMI.2022.3207091 ］

Isola P ， Zhu J Y ， Zhou T H and Efros A A . 2017 . Image-to-image translation with conditional adversarial networks // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， USA ： IEEE： 5967 - 5976 ［ DOI： 10.1109/CVPR.2017.632 http://dx.doi.org/10.1109/CVPR.2017.632 ］

Karras T ， Aila T ， Laine S and Lehtinen J . 2018 . Progressive growing of GANs for improved quality， stability， and variation ［EB/OL］. ［ 2023-04-03 ］. https://arxiv.org/pdf/1710.10196.pdf https://arxiv.org/pdf/1710.10196.pdf

Kingma D P and Ba J . 2017 . Adam： a method for stochastic optimization ［EB/OL］. ［ 2023-04-03 ］. https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf

Teterwak P ， Sarna A ， Krishnan D ， Maschinot A ， Belanger D ， Liu C and Freeman W . 2019 . Boundless： generative adversarial networks for image extension // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 10520 - 10529 ［ DOI： 10.1109/ICCV.2019.01062 http://dx.doi.org/10.1109/ICCV.2019.01062 ］

Li C and Wand M . 2016 . Precomputed real-time texture synthesis with markovian generative adversarial networks // Proceedings of the 14th European Conference on Computer Vision . Amsterdam， the Netherlands ： Springer： 702 - 716 ［ DOI： 10.1007/978-3-319-46487-9_43 http://dx.doi.org/10.1007/978-3-319-46487-9_43 ］

Li J Y ， He F X ， Zhang L F ， Du B and Tao D C . 2019 . Progressive reconstruction of visual structure for image inpainting // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 5961 - 5970 ［ DOI： 10.1109/ICCV.2019.00606 http://dx.doi.org/10.1109/ICCV.2019.00606 ］

Li J Y ， Wang N ， Zhang L F ， Du B and Tao D C . 2020 . Recurrent feature reasoning for image inpainting // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 7757 - 7765 ［ DOI： 10.1109/CVPR42600.2020.00778 http://dx.doi.org/10.1109/CVPR42600.2020.00778 ］

Liu G L ， Reda F A ， Shih K J ， Wang T C ， Tao A and Catanzaro B . 2018 . Image inpainting for irregular holes using partial convolutions // Proceedings of the 15th European Conference on Computer Vision . Munich， Germany ： Springer： 89 - 105 ［ DOI： 10.1007/978-3-030-01252-6_6 http://dx.doi.org/10.1007/978-3-030-01252-6_6 ］

Liu H Y ， Jiang B ， Song Y B ， Huang W and Yang C . 2020 . Rethinking image inpainting via a mutual encoder-decoder with feature equalizations // Processings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 725 - 741 ［ DOI： 10.1007/978-3-030-58536-5_43 http://dx.doi.org/10.1007/978-3-030-58536-5_43 ］

Liu Q K ， Tan Z T ， Chen D D ， Chu Q ， Dai X Y ， Chen Y P ， Liu M C ， Yuan L and Yu N H . 2022 . Reduce information loss in Transformers for pluralistic image inpainting // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 11337 - 11347 ［ DOI： 10.1109/CVPR52688.2022.01106 http://dx.doi.org/10.1109/CVPR52688.2022.01106 ］

Nazeri K ， Ng E ， Joseph T ， Qureshi F Z and Ebrahimi M . 2019 . EdgeConnect： generative image inpainting with adversarial edge learning ［EB/OL］. ［ 2023-04-03 ］. https://arxiv.org/pdf/1901.00212.pdf https://arxiv.org/pdf/1901.00212.pdf

Nilsback M E and Zisserman A . 2008 . Automated flower classification over a large number of classes // Proceedings of the 6th Indian Conference on Computer Vision， Graphics and Image Processing . Bhubaneswar， India ： IEEE： 722 - 729 ［ DOI： 10.1109/ICVGIP.2008.47 http://dx.doi.org/10.1109/ICVGIP.2008.47 ］

Pathak D ， Krähenbühl P ， Donahue J ， Darrell T and Efros A A . 2016 . Context encoders： feature learning by inpainting // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 2536 - 2544 ［ DOI： 10.1109/CVPR.2016.278 http://dx.doi.org/10.1109/CVPR.2016.278 ］

Qiang Z P ， He L B ， Chen X and Xu D . 2019 . Survey on deep learning image inpainting methods . Journal of Image and Graphics ， 24 （ 3 ）： 447 - 463

强振平，何丽波，陈旭，徐丹 . 2019 . 深度学习图像修复方法综述 . 中国图象图形学报， 24 （ 3 ）： 447 - 463 ［ DOI： 10.11834/jig.180408 http://dx.doi.org/10.11834/jig.180408 ］

Ren Y R ， Yu X M ， Zhang R N ， Li T H ， Liu S and Li G . 2019 . StructureFlow： image inpainting via structure-aware appearance flow // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 181 - 190 ［ DOI： 10.1109/ICCV.2019.00027 http://dx.doi.org/10.1109/ICCV.2019.00027 ］

Sagong M C ， Shin Y G ， Kim S W ， Park S and Ko S J . 2019 . PEPSI： fast image inpainting with parallel decoding network // Proceedings of 2019 IEEE/CVF conference on computer vision and pattern recognition . Long Beach， USA ： IEEE： 11352 - 11360 ［ DOI： 10.1109/CVPR.2019.01162 http://dx.doi.org/10.1109/CVPR.2019.01162 ］

Shan Q ， Curless B ， Furukawa Y ， Hernandez C and Seitz S M . 2014 . Photo uncrop // Proceedings of the 13th European Conference on Computer Vision . Zurich， Switzerland ： Springer： 16 - 31 ［ DOI： 10.1007/978-3-319-10599-4_2 http://dx.doi.org/10.1007/978-3-319-10599-4_2 ］

Slossberg R ， Shamai G and Kimmel R . 2019 . High quality facial surface and texture synthesis via generative adversarial networks // Proceedings of the 15th European Conference on Computer Vision Workshops . Munich， Germany ： Springer： 498 - 513 ［ DOI： 10.1007/978-3-030-11015-4_36 http://dx.doi.org/10.1007/978-3-030-11015-4_36 ］

Vaswani A ， Shazeer N ， Parmar N ， Uszkoreit J ， Jones L ， Gomez A N ， Kaiser Ł and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 6000 - 6010

Wang M ， Lai Y K ， Liang Y ， Martin R R and Hu S M . 2014 . BiggerPicture： data-driven image extrapolation using graph matching . ACM Transactions on Graphics ， 33 （ 6 ）： # 173 ［ DOI： 10.1145/2661229.2661278 http://dx.doi.org/10.1145/2661229.2661278 ］

Wang Q N and Chen Y . 2022 . Enhanced semantic dual decoder generation model for image inpainting . Journal of Image and Graphics ， 27 （ 10 ）： 2994 - 3009

王倩娜，陈燚 . 2022 . 面向图像修复的增强语义双解码器生成模型 . 中国图象图形学报， 27 （ 10 ）： 2994 - 3009 ［ DOI： 10.11834/jig.210301 http://dx.doi.org/10.11834/jig.210301 ］

Wang Y ， Tao X ， Shen X Y and Jia J Y . 2019 . Wide-context semantic image extrapolation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 1399 - 1408 ［ DOI： 10.1109/CVPR.2019.00149 http://dx.doi.org/10.1109/CVPR.2019.00149 ］

Xian W Q ， Sangkloy P ， Agrawal V ， Raj A ， Lu J W ， Fang C ， Yu F and Hays J . 2018 . TextureGAN： controlling deep image synthesis with texture patches // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 8456 - 8465 ［ DOI： 10.1109/CVPR.2018.00882 http://dx.doi.org/10.1109/CVPR.2018.00882 ］

Xie C H ， Liu S H ， Li C ， Cheng M M ， Zuo W M ， Liu X ， Wen S L and Ding E R . 2019 . Image inpainting with learnable bidirectional attention maps // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 8857 - 8866 ［ DOI： 10.1109/ICCV.2019.00895 http://dx.doi.org/10.1109/ICCV.2019.00895 ］

Xiong W ， Yu J H ， Lin Z ， Yang J M ， Lu X ， Barnes C and Luo J B . 2019 . Foreground-aware image inpainting // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 5833 - 5841 ［ DOI： 10.1109/CVPR.2019.00599 http://dx.doi.org/10.1109/CVPR.2019.00599 ］

Yeh R A ， Chen C ， Yian L T ， Schwing A G ， Hasegawa-Johnson M and Do M N . 2017 . Semantic image inpainting with deep generative models // Proceedings of 2017 IEEE conference on computer vision and pattern recognition . Honolulu， USA ： IEEE： 6882 - 6890 ［ DOI： 10.1109/CVPR.2017.728 http://dx.doi.org/10.1109/CVPR.2017.728 ］

Yu J H ， Lin Z ， Yang J M ， Shen X H ， Lu X and Huang T . 2019 . Free-form image inpainting with gated convolution // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 4470 - 4479 ［ DOI： 10.1109/ICCV.2019.00457 http://dx.doi.org/10.1109/ICCV.2019.00457 ］

Yu J H ， Lin Z ， Yang J M ， Shen X H ， Lu X and Huang T S . 2018 . Generative image inpainting with contextual attention // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Honolulu， USA ： IEEE： 5505 - 5514 ［ DOI： 10.1109/CVPR.2018.00577 http://dx.doi.org/10.1109/CVPR.2018.00577 ］

Zhang R ， Isola P ， Efros A A ， Shechtman E and Wang O . 2018 . The unreasonable effectiveness of deep features as a perceptual metric // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 586 - 595 ［ DOI： 10.1109/CVPR.2018.00068 http://dx.doi.org/10.1109/CVPR.2018.00068 ］

Zhang Y D ， Xiao J X ， Hays J and Tan P . 2013 . FrameBreak： dramatic image extrapolation by guided shift-maps // Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition . Portland， USA ： IEEE： 1171 - 1178 ［ DOI： 10.1109/CVPR.2013.155 http://dx.doi.org/10.1109/CVPR.2013.155 ］

Zhou B L ， Lapedriza A ， Khosla A ， Oliva A and Torralba A . 2018 . Places： a 10 million image database for scene recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 40 （ 6 ）： 1452 - 1464 ［ DOI： 10.1109/TPAMI.2017.2723009 http://dx.doi.org/10.1109/TPAMI.2017.2723009 ］

Zhu J Y ， Krähenbühl P ， Shechtman E and Efros A A . 2016 . Generative visual manipulation on the natural image manifold // Proceedings of the 14th European Conference on Computer Vision . Amsterdam， the Netherlands ： Springer： 597 - 613 ［ DOI： 10.1007/978-3-319-46454-1_36 http://dx.doi.org/10.1007/978-3-319-46454-1_36 ］

Zhu X Z ， Su W J ， Lu L W ， Li B ， Wang X G and Dai J F . 2021 . Deformable DETR： deformable Transformers for end-to-end object detection ［EB/OL］. ［ 2023-04-03 ］. https://arxiv.org/pdf/2010.04159.pdf https://arxiv.org/pdf/2010.04159.pdf

文章被引用时，请邮件提醒。

提交

融合交叉注意力与双编码器的医学图像分割

图像去模糊研究综述

结合金字塔Transformer与浅层CNN的变电站图像篡改检测

基于深度学习的医学图像分割方法综述

改进实时目标检测Transformer的持刀危险行为检测算法