局部特征增强的转置自注意力图像超分辨率重建

孙阳; 丁建伟; 张琪; 邓琪瑶

doi:10.11834/jig.230320

低质图像处理与语义理解 | 浏览量 : 0 下载量: 11 CSCD: 0

PDF
导出
分享
收藏
专辑

局部特征增强的转置自注意力图像超分辨率重建
Images super-resolution reconstruction of transposed self-attention with local feature enhancement
2024年29卷第4期页码：908-921
纸质出版日期： 2024-04-16 ，
DOI： 10.11834/jig.230320
稿件说明：

移动端阅览

孙阳，丁建伟，张琪，邓琪瑶. 2024. 局部特征增强的转置自注意力图像超分辨率重建. 中国图象图形学报， 29(04):0908-0921

Sun Yang， Ding Jianwei， Zhang Qi， Deng Qiyao. 2024. Images super-resolution reconstruction of transposed self-attention with local feature enhancement. Journal of Image and Graphics， 29(04):0908-0921
孙阳，丁建伟，张琪，邓琪瑶. 2024. 局部特征增强的转置自注意力图像超分辨率重建. 中国图象图形学报， 29(04):0908-0921 DOI： 10.11834/jig.230320.

Sun Yang， Ding Jianwei， Zhang Qi， Deng Qiyao. 2024. Images super-resolution reconstruction of transposed self-attention with local feature enhancement. Journal of Image and Graphics， 29(04):0908-0921 DOI： 10.11834/jig.230320.

摘要

目的

超分辨率（super resolution，SR）重建任务通过划分窗口引入自注意力机制进行特征提取，获得了令人瞩目的成绩。针对划分窗口应用自注意力机制时会限制图像信息聚合范围、制约模型对特征信息进行建模的问题，本文基于转置自注意力机制构建全局和局部信息建模网络捕捉图像像素依赖关系。

方法

首先采用轻量的基线模型对特征进行简单关系建模，然后将空间维度上的自注意力机制转换到通道维度，通过计算交叉协方差矩阵构建各像素点之间的长距离依赖关系，接着通过引入通道注意力块补充图像重建所需的局部信息，最后构建双门控机制控制信息在模型中的流动，提高模型对特征的建模能力及其鲁棒性。

结果

实验在5个基准数据集Set5、Set14、BSD100、Urban100、Manga109上与主流方法进行了比较，在不同比例因子的SR任务中均获得了最佳或者次佳的结果。与SwinIR（image restoration using swin Transformer）在×2倍SR任务中相比，在以上5个数据集上的峰值信噪比分别提升了0.03 dB、0.21 dB、0.05 dB、0.29 dB和0.10 dB，结构相似度也获得了极大提升，同时视觉感知优化十分明显。

结论

所提出的网络模型能够更充分地对特征信息全局关系进行建模，同时也不会丢失图像特有的局部相关性。重建图像质量明显提高，细节更加丰富，充分说明了本文方法的有效性与先进性。

Abstract

Objective

Research on super-resolution image reconstruction based on deep learning techniques has gained exceptional progress in recent years. In particular， when the development of traditional convolutional neural networks reached a bottleneck， Transformer， which performs extremely well in natural language processing， was introduced to approximate super-resolution image reconstruction. However， the computational complexity of Transformer is related to the square of the width and height of the input image， leading to the inability to migrate Transformer to low-level computer vision tasks fully. Recent methods， such as image restoration using Swin Transformer （SwinIR）， have achieved excellent performance by dividing windows， performing self-attention within the windows and interacting the information between the windows. However， this method of dividing windows increases the computational burden as the window size increases. Moreover， the window division method cannot model the global information of images completely， resulting in partial loss of information. To solve the above problems， we model the long-range dependencies of images by constructing a Transformer block while maintaining a moderate level of the number of parameters. Excellent super-resolution reconstruction performance is achieved by constructing global dependencies of features.

Method

The proposed super-resolution network based on self-attention （SRTSA） consists of four main stages： a shallow feature extraction module， a deep feature special extraction module， an image upsampling module， and an image reconstruction module. The shallow feature extraction part consists of a 3 × 3 convolution. The deep feature extraction part mainly consists of a global and local information extraction block （GLIEB）. Our proposed GLIEB performs simple relational modeling through a sufficiently lightweight nonlinear activation free block （NAFBlock）. Although dropout can improve the robustness of the model， we discard the dropout layer to avoid losing other information before modeling the feature information globally. In the global modeling of feature information using the transposed self-attention mechanism， we keep the features with positive effects on image reconstruction and discard the features with negative effects by replacing the softmax activation function in the self-attention mechanism with the ReLU activation function， which makes the reconstructed global dependencies more robust. Given that an image includes global and local information， a residual channel attention module is used to supplement the local information and enhance the expressive ability of the model. Furthermore， a new dual-channel gating mechanism is introduced to control the flow of information in the model to improve the modeling capability of the model for features and its robustness. The image upsampling module uses subpixel convolution to expand the features to the target dimension， and the reconstruction module employs a 3 × 3 convolution to obtain the final reconstruction results. For the loss function， although many loss functions have been proposed to optimize model training， to demonstrate the advancement and effectiveness of our model， we use the same L1 loss function as that of SwinIR to supervise the model training. The L1 loss function can provide a stable gradient that allows the model to converge quickly. In the image training phase， 800 images from the DIV2K dataset are used for training. The 800 training images are randomly rotated or horizontally flipped to expand the dataset， and 16 LR image blocks of size 48 × 48 pixels are used as input in each iteration. The Adam optimizer is used for training.

Result

We test on five datasets commonly used in super-resolution tasks， namely， Set5， Set14， Berkeley segmentation dataset 100（BSD100）， Urban100， and Manga109， to demonstrate the effectiveness and robustness of the proposed method. We also compare the proposed method with SRCNN， VDSR， EDSR， RCAN， SAN， HAN， NLSA， and SwinIR networks in terms of objective metrics. These networks are supervised using the L1 loss function during the training process. The peak signal-to-noise ratio （PSNR） and structural similarity （SSIM） are calculated on the Y channel of the YCbCr space of the output image to measure the image reconstruction effect. Experimental results show that the PSNR and SSIM values obtained our method are both optimal. In the ×2 super-resolution tasks， compared with those of SwinIR， the PSNR of the proposed method is improved by 0.03 dB， 0.21 dB， 0.05 dB， 0.29 dB， and 0.10 dB， and the SSIM is enhanced by 0.000 4， 0.001 6， 0.000 9， and 0.002 7 on four datasets， except Manga109. The reconstruction effect demonstrates that SRTSA can recover more detailed information and more texture structure compared with most methods. From the attribution analysis of the model using local attribution maps （LAM）， SRTSA uses a larger range of pixels in the reconstruction process compared with other methods， such as SwinIR， which fully illustrates the global modeling capability of SRTSA.

Conclusion

The proposed super-resolution image reconstruction algorithm based on a transposed self-attention mechanism can fully model the global relationship of feature information without losing the local relationship of features by converting the global relationship modeling in the spatial dimension into a channel dimension for global relationship modeling. It also contains global and local information， which effectively improves the image super-resolution reconstruction performance. The excellent PSNR and SSIM on five datasets and the significantly high quality of the reconstructed images with rich details and sharp edges fully demonstrate the effectiveness and advancedness of the proposed method.

关键词

图像超分辨率重建自注意力机制深度学习图像复原门控网络

Keywords

image super-resolution reconstructionself-attentiondeep learningimage restorationgated networks

references

Agustsson E and Timofte R. 2017. NTIRE 2017 challenge on single image super-resolution： dataset and study//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu， USA： IEEE： 1122-1131 ［DOI： 10.1109/CVPRW.2017.150http://dx.doi.org/10.1109/CVPRW.2017.150］

Bevilacqua M， Roumy A， Guillemot C and Morel M L A. 2012. Low-complexity single-image super-resolution based on nonnegative neighbor embedding//Proceedings of the 23rd British Machine Vision Conference. Surrey， UK： BMVA Press：1-12 ［DOl1：10.5244/c.26.135http://dx.doi.org/10.5244/c.26.135］

Chen L Y， Chu X J， Zhang X Y and Sun J. 2022. Simple baselines for image restoration//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 17-33 ［DOI： 10.1007/978-3-031-20071-7_2http://dx.doi.org/10.1007/978-3-031-20071-7_2］

Chu X J， Chen L Y， Chen C P and Lu X. 2022. Improving image restoration by revisiting global information aggregation//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv， Israel： Springer： 53-71 ［DOI： 10.1007/978-3-031-20071-7_4http://dx.doi.org/10.1007/978-3-031-20071-7_4］

Dai T， Cai J R， Zhang Y B， Xia S T and Zhang L. 2019. Second-order attention network for single image super-resolution//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 11057-11066 ［DOI： 10.1109/CVPR.2019.01132http://dx.doi.org/10.1109/CVPR.2019.01132］

Dong C， Loy C C， He K M and Tang X O. 2014. Learning a deep convolutional network for Image super-resolution//Proceedings of the 13th European Conference on Computer Vision. Zurich， Switzerland： Springer： 184-199 ［DOI： 10.1007/978-3-319-10593-2_13http://dx.doi.org/10.1007/978-3-319-10593-2_13］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale ［EB/OL］. ［2023-05-10］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Gu J J and Dong C. 2021. Interpreting super-resolution networks with local attribution maps//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 9195-9204 ［DOI： 10.1109/CVPR46437.2021.00908http://dx.doi.org/10.1109/CVPR46437.2021.00908］

Huang J B， Singh A and Ahuja N. 2015. Single image super-resolution from transformed self-exemplars//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 5197-5206 ［DOI： 10.1109/CVPR.2015.7299156http://dx.doi.org/10.1109/CVPR.2015.7299156］

Kim J， Lee J K and Lee K M. 2016. Accurate image super-resolution using very deep convolutional networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 1646-1654 ［DOI： 10.1109/CVPR.2016.182http://dx.doi.org/10.1109/CVPR.2016.182］

Ledig C， Theis L， Husz􀅡r F， Caballero J， Cunningham A， Acosta A， Aitken A， Tejani A， Totz J， Wang Z H and Shi W Z. 2017. Photo-realistic single image super-resolution using a generative adversarial network//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 105-114 ［DOI： 10.1109/CVPR.2017.19http://dx.doi.org/10.1109/CVPR.2017.19］

Li X， Pan J S， Tang J H and Dong J X. 2023. DLGSANet： lightweight dynamic local and global self-attention networks for image super-resolution ［EB/OL］. ［2023-05-10］. https://arxiv.org/pdf/2301.02031.pdfhttps://arxiv.org/pdf/2301.02031.pdf

Liang J Y， Cao J Z， Sun G L， Zhang K， Van Gool L and Timofte R. 2021. SwinIR： image restoration using Swin Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal， Canada： IEEE： 1833-1844 ［DOI： 10.1109/ICCVW54120.2021.00210http://dx.doi.org/10.1109/ICCVW54120.2021.00210］

Lim B， Son S， Kim H， Nah S and Lee K M. 2017. Enhanced deep residual networks for single image super-resolution//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu， USA： IEEE： 1132-1140 ［DOI： 10.1109/CVPRW.2017.151http://dx.doi.org/10.1109/CVPRW.2017.151］

Liu H C， Ren W Q， Wang R and Cao X C. 2022. A super-resolution Transformer fusion network for single blurred image. Journal of Image and Graphics， 27（5）： 1616-1631

刘花成，任文琦，王蕊，操晓春. 2022. 用于单幅模糊图像超分辨的Transformer融合网络. 中国图象图形学报， 27（5）： 1616-1631 ［DOI： 10.11834/jig.210847http://dx.doi.org/10.11834/jig.210847］

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021. Swin Transformer： hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Montreal， Canada： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Martin D， Fowlkes C， Tal D and Malik J. 2002. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics//Proceedings of the 8th IEEE International Conference on Computer Vision. Vancouver， Canada： IEEE： 416-423 ［DOI： 10.1109/iccv.2001.937655http://dx.doi.org/10.1109/iccv.2001.937655］

Matsui Y， Ito K， Aramaki Y， Fujimoto A， Ogawa T， Yamasaki T and Aizawa K. 2017. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications， 76（20）： 21811-21838 ［DOI： 10.1007/s11042-016-4020-zhttp://dx.doi.org/10.1007/s11042-016-4020-z］

Mei Y Q， Fan Y C and Zhou Y Q. 2021. Image super-resolution with non-local sparse attention//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 3516-3525 ［DOI： 10.1109/CVPR46437.2021.00352http://dx.doi.org/10.1109/CVPR46437.2021.00352］

Niu B， Wen W L， Ren W Q， Zhang X D， Yang L P， Wang S Z， Zhang K H， Cao X C and Shen H F. 2020. Single image super-resolution via a holistic attention network//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 191-207 ［DOI： 10.1007/978-3-030-58610-2_12http://dx.doi.org/10.1007/978-3-030-58610-2_12］

Qiu D F， Jiang J J， Hu X Y， Liu X M and Ma J Y. 2023. Guided Transformer for high-resolution visible image guided infrared image super-resolution. Journal of Image and Graphics， 28（1）： 196-206

邱德粉，江俊君，胡星宇，刘贤明，马佳义. 2023. 高分辨率可见光图像引导红外图像超分辨率的Transformer网络. 中国图象图形学报， 28（1）： 196-206 ［DOI： 10.11834/jig.220604http://dx.doi.org/10.11834/jig.220604］

Ronneberger O， Fischer P and Brox T. 2015. U-Net： convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich， Germany： Springer： 234-241 ［DOI： 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28］

Shi W Z， Caballero J， Husz􀅡r F， Totz J， Aitken A P， Bishop R， Rueckert D and Wang Z H. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 1874-1883 ［DOI： 10.1109/CVPR.2016.207http://dx.doi.org/10.1109/CVPR.2016.207］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wang M H， Ke F H， Liang Y， Fan Z and Liao L. 2022. 3D attention and Transformer based single image deraining network. Journal of Image and Graphics， 27（5）： 1509-1521

王美华，柯凡晖，梁云，范衠，廖磊. 2022. 融合3D注意力和Transformer的图像去雨网络. 中国图象图形学报， 27（5）： 1509-1521 ［DOI： 10.11834/jig.210794http://dx.doi.org/10.11834/jig.210794］

Wang Z D， Cun X D， Bao J M， Zhou W G， Liu G Z and Li H Q. 2022. Uformer： a general U-shaped Transformer for image restoration//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 17662-17672 ［DOI： 10.1109/CVPR52688.2022.01716http://dx.doi.org/10.1109/CVPR52688.2022.01716.］

Zamir S W， Arora A， Khan S， Hayat M， Khan F S and Yang M H. 2022. Restormer： efficient Transformer for high-resolution image restoration//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 5718-5729 ［DOI： 10.1109/CVPR52688.2022.00564http://dx.doi.org/10.1109/CVPR52688.2022.00564］

Zeyde R， Elad M and Protter M. 2012. On single image scale-up using sparse-representations//Proceedings of the 7th International Conference on Curves and Surfaces. Avignon， France： Springer： 711-730 ［DOI： 10.1007/978-3-642-27413-8_47http://dx.doi.org/10.1007/978-3-642-27413-8_47］

Zhang K， Zuo W M， Chen Y J， Meng D Y and Zhang L. 2017. Beyond a Gaussian denoiser： residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing， 26（7）： 3142-3155 ［DOI： 10.1109/TIP.2017.2662206http://dx.doi.org/10.1109/TIP.2017.2662206］

Zhang Y L， Li K P， Li K， Wang L C， Zhong B N and Fu Y. 2018a. Image super-resolution using very deep residual channel attention networks//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 294-310 ［DOI： 10.1007/978-3-030-01234-2_18http://dx.doi.org/10.1007/978-3-030-01234-2_18］

Zhang Y L， Tian Y P， Kong Y， Zhong B E and Fu Y. 2018b. Residual dense network for image super-resolution//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 2472-2481 ［DOI： 10.1109/CVPR.2018.00262http://dx.doi.org/10.1109/CVPR.2018.00262］

文章被引用时，请邮件提醒。

提交