融合通道注意力的跨尺度Transformer图像超分辨率重建
Cross-scale Transformer image super-resolution reconstruction with fusion channel attention
- 2025年30卷第3期 页码:784-797
收稿日期:2024-05-28,
修回日期:2024-08-28,
纸质出版日期:2025-03-16
DOI: 10.11834/jig.240279
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-05-28,
修回日期:2024-08-28,
纸质出版日期:2025-03-16
移动端阅览
目的
2
针对在超分辨率任务中,Transformer模型存在特征提取模式单一、重建图像高频细节丢失和结构失真的问题,提出了一种融合通道注意力的跨尺度Transformer图像超分辨率重建模型。
方法
2
模型由4个模块组成:浅层特征提取、跨尺度深层特征提取、多级特征融合以及高质量重建模块。浅层特征提取利用卷积处理早期图像,获得更稳定的输出;跨尺度深层特征提取利用跨尺度Transformer和强化通道注意力机制,扩大感受野并通过加权筛选提取不同尺度特征以便融合;多级特征融合模块利用强化通道注意力机制,实现对不同尺度特征通道权重的动态调整,促进模型对丰富上下文信息的学习,增强模型在图像超分辨率重建任务中的能力。
结果
2
在Set5、Set14、BSD100(Berkeley segmentation dataset 100)、Urban100(urban scene 100)和Manga109标准数据集上的模型评估结果表明,相较于SwinIR超分辨率模型,所提模型在峰值信噪比上提高了0.06~0.25 dB,且重建图像视觉效果更好。
结论
2
提出的融合通道注意力的跨尺度Transformer图像超分辨率重建模型,通过融合卷积特征与Transformer特征,并利用强化通道注意力机制减少图像中噪声和冗余信息,降低模型产生图像模糊失真的可能性,图像超分辨率性能有效提升,在多个公共实验数据集的测试结果验证了所提模型的有效性。
Objective
2
The image super-resolution reconstruction technique refers to a method for converting low-resolution (LR) images to high-resolution (HR) images within the same scene. In recent years, this technique has been widely used in computer vision, image processing, and other fields due to its wide practical application value and far-reaching theoretical importance. Although the model based on convolutional neural networks has made remarkable progress, most super-resolution network structures remain in a single-layer level end-to-end format to improve the reconstruction performance. This approach often overlooks the multilayer level feature information during the network reconstruction process, limiting the reconstruction performance of the model. With the advancement of deep learning technology, Transformer-based network architectures have been introduced into the field of computer vision, yielding substantial results. Researchers have applied Transform models to underlying vision tasks, including image super-resolution reconstruction. However, in this context, the Transformer model suffers from a single feature extraction pattern, loss of high-frequency details in the reconstructed image, and structural distortion. A cross-scale Transformer image super-resolution reconstruction model with fusion channel attention is proposed to address these problems.
Method
2
The model comprises the following four modules: shallow feature extraction, cross-scale deep feature extraction, multilevel feature fusion, and a high-quality reconstruction module. Shallow feature extraction uses convolution to process early images to obtain highly stable outputs, and the convolutional layer can provide stable optimization and extraction results during early visual feature processing. The cross-scale deep feature extraction module uses the cross-scale Transformer and the enhanced channel attention mechanism to acquire features at different scales. The core of the cross-scale Transformer lies in the cross-scale self-attention mechanism and the gated convolutional feedforward network, which down samples the feature maps to different scales by scale factors and learns contextual information using image self-similarity, and the gated convolutional network encodes spatial neighboring pixel position information and helps learn the local image structure, replacing the feedforward network in the traditional Transformer. A reinforced channel attention mechanism is used after the cross-scale Transformer to expand the sensory field and extract different scale features to replace the original features via weighted filtering for backward propagation. Increasing the depth of the network will lead to saturation. Thus, the nu
mber of residual cross-scale Transformer blocks is set to 3 to maintain a balance between model complexity and super-resolution reconstruction performance. After stacking different scale features in the multilevel feature fusion module, the enhanced channel attention mechanism is used to dynamically adjust the channel weights of different scale features and learn rich contextual information, thereby enhancing the network reconstruction capability. In the high-quality reconstruction module, convolutional layers and pixel blending methods are used to up-sample features to the corresponding dimensions of high-resolution images. In the training phase, the model is trained using 900 HR images from the DIV2K dataset, and the corresponding LR images are generated from the HR images using double-triple downsampling (with downsampling multiples of ×2, ×3 and ×4). The network is optimized using Adam’s algorithm with
<math id="M1"><msub><mrow><mi>L</mi></mrow><mrow><mn mathvariant="normal">1</mn></mrow></msub></math>
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=75623310&type=
3.21733332
https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=75623292&type=
2.70933342
loss as the loss function.
Result
2
Tests on five standard datasets, namely, Set5, Set14, BSD100, Urban100, and Manga109, are performed, and the performance of the proposed model is compared with 10 state-of-the-art models. These models include the following: enhanced deep residual networks for single image super-resolution (EDSR), residual channel attention networks (RCAN), second-order attention network (SAN), cross-scale non-local attention (CSNLA), the cross-scale internal graph neural network (IGNN), holistic attention network (HAN), non-local sparse attention (NLSA), image restoration using Swin Transformer (SwinIR), efficient long-range attention network (ELAN), and permuted self-attention (SRFormer). Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are used as metrics to measure the performance of these methods. Humans are very sensitive to the brightness of an image; therefore, these metrics are measured in the Y-channel of the image. Experimental results show that the proposed model obtains high PSNR and SSIM values and recovers additional detailed information and highly accurate textures at magnification factors of ×2, ×3, and ×4. The proposed method improves 0.13~0.25 dB over SwinIR and 0.07~0.21 dB over ELAN on the Urban100 dataset and 0.07~0.21 dB over SwinIR and 0.06~0.19 dB over ELAN on the Manga109 dataset. The localized attribution map (LAM) is used to further explore the model performance. The experimental results revealed that the proposed model can utilize a wider range of pixel information, and the proposed model exhibits a higher diffusion index (DI) compared to SwinIR, proving the effectiveness of the proposed model from the interpretability viewpoint.
Conclusion
2
The proposed cross-scale Transformer image super-resolution reconstruction model with multilevel fusion channel attention reduces noise and redundant information in the image by fusing convolutional features with Transformer features. This model also uses a strengthened channel attention mechanism to reduce the likelihood of image blurring and distortion in the model, and the image super-resolution performance is effectively improved. The test results verify the effectiveness of the multi-tip model in numerous public experimental datasets. The model visually obtains a reconstructed image that is sharper and closer to the real image with fewer artefacts.
Agustsson E and Timofte R . 2017 . NTIRE 2017 challenge on single image super-resolution: dataset and study // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Honolulu, USA : IEEE: 1122 - 1131 [ DOI: 10.1109/CVPRW.2017.150 http://dx.doi.org/10.1109/CVPRW.2017.150 ]
Al-Hayani M H A and Janabi S . 2023 . Medical magnetic resonance imagery super-resolution via residual dense network [EB/OL]. [ 2024-05-28 ]. https://www.researchgate.net/publication/369066884 https://www.researchgate.net/publication/369066884
Bevilacqua M , Roumy A , Guillemot C and Morel M L A . 2012 . Low-complexity single-image super-resolution based on nonnegative neighbor embedding // Proceedings of 2012 British Machine Vision Conference . Surrey, UK : BMVA Press: 135.1- 135 .10 [DOI: 10.5244/C.26.135]
Cai Q , Qian Y , Li J , Lv J , Yang Y H , Wu F and Zhang D . 2023 . HIPA: Hierarchical Patch Transformer for Single Image Super Resolution . IEEE Transactions on Image Processing , 32 : 3226 - 3237 . DOI: 10.1109/TIP.2023.3279977 http://dx.doi.org/10.1109/TIP.2023.3279977 .
Cao H , Wang Y Y , Chen J , Jiang D S , Zhang X P , Tian Q and Wang M N . 2021 . Swin-Unet: unet-like pure Transformer for medical image segmentation [EB/OL]. [ 2024-05-28 ]. http://arxiv.org/pdf/2105.05537.pdf http://arxiv.org/pdf/2105.05537.pdf
Carion N , Massa F , Synnaeve G , Usunier N , Kirillov A and Zagoruyko S . 2020 . End-to-end object detection with Transformers // Proceedings of the 16th European Conference on Computer Vision——ECCV 2020 . Glasgow, UK : Springer: 213 - 229 [ DOI: 10.1007/978-3-030-58452-8_13 http://dx.doi.org/10.1007/978-3-030-58452-8_13 ]
Chan K C K , Zhou S C , Xu X Y and Loy C C . 2022 . BasicVSR++: improving video super-resolution with enhanced propagation and alignment // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . New Orleans, USA : IEEE: 5962 - 5971 [ DOI: 10.1109/CVPR52688.2022.00588 http://dx.doi.org/10.1109/CVPR52688.2022.00588 ]
Chu X X , Tian Z , Wang Y Q , Zhang B , Ren H B , Wei X L , Xia H X and Shen C H . 2021 . Twins: revisiting the design of spatial attention in vision Transformers [EB/OL]. [ 2024-05-28 ]. http://arxiv.org/pdf/2104.13840.pdf http://arxiv.org/pdf/2104.13840.pdf
Dai T , Cai J R , Zhang Y B , Xia S T and Zhang L . 2019 . Second-order attention network for single image super-resolution // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Long Beach, USA : IEEE: 11057 - 11066 [ DOI: 10.1109/CVPR.2019.01132 http://dx.doi.org/10.1109/CVPR.2019.01132 ]
Dong C , Loy C C , He K M and Tang X O . 2014 . Learning a deep convolutional network for image super-resolution // Proceedings of the 13th European Conference on Computer Vision . Zurich, Switzerland : Springer: 184 - 199 [ DOI: 10.1007/978-3-319-10593-2_13 http://dx.doi.org/10.1007/978-3-319-10593-2_13 ]
Dong X Y , Sun X , Jia X P , Xi Z H , Gao L R and Zhang B . 2021 . Remote sensing image super-resolution using novel dense-sampling networks . IEEE Transactions on Geoscience and Remote Sensing , 59 ( 2 ): 1618 - 1663 [ DOI: 10.1109/TGRS.2020.2994253 http://dx.doi.org/10.1109/TGRS.2020.2994253 ]
Dosovitskiy A , Beyer L , Kolesnikov A , Weissenborn D , Zhai X H , Unterthiner T , Dehghani M , Minderer M , Heigold G , Gelly S , Uszkoreit J and Houlsby N . 2021 . An image is worth 16 × 16 words: Transformers for image recognition at scale [EB/OL]. [ 2024-05-28 ]. https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2010.11929.pdf
Gao X B , Lu W , Tao D C and Li X L . 2009 . Image quality assessment based on multiscale geometric analysis . IEEE Transactions on Image Processing , 18 ( 7 ): 1409 - 1423 [ DOI: 10.1109/TIP.2009.2018014 http://dx.doi.org/10.1109/TIP.2009.2018014 ]
Gu J J and Dong C . 2021 . Interpreting super-resolution networks with local attribution maps // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 9195 - 9204 [ DOI: 10.1109/CVPR46437.2021.00908 http://dx.doi.org/10.1109/CVPR46437.2021.00908 ]
Huang J B , Singh A and Ahuja N . 2015 . Single image super-resolution from Transformed self-exemplars // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Boston, USA : IEEE: 5197 - 5206 [ DOI: 10.1109/CVPR.2015.7299156 http://dx.doi.org/10.1109/CVPR.2015.7299156 ]
Huang M X , Liu Y L , Peng Z H , Liu C Y , Lin D H , Zhu S G , Yuan N , Ding K and Jin L W . 2022 . SwinTextSpotter: scene text spotting via better synergy between text detection and text recognition // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . New Orleans, USA : IEEE: 4583 - 4593 [ DOI: 10.1109/CVPR52688.2022.00455 http://dx.doi.org/10.1109/CVPR52688.2022.00455 ]
Isobe T , Jia X , Tao X , Li C L , Li R H , Shi Y J , Mu J , Lu H C and Tai Y W . 2022 . Look back and forth: video super-resolution with explicit temporal difference modeling // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . New Orleans, USA : IEEE: 17390 - 17399 [ DOI: 10.1109/CVPR52688.2022.01689 http://dx.doi.org/10.1109/CVPR52688.2022.01689 ]
Jo Y , Oh S W , Kang J and Kim S J . 2018 . Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City, USA : IEEE: 3224 - 3232 [ DOI: 10.1109/CVPR.2018.00340 http://dx.doi.org/10.1109/CVPR.2018.00340 ]
Kim J , Lee J K and Lee K M . 2016 . Accurate image super-resolution using very deep convolutional networks // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Vegas, USA : IEEE: 1646 - 1654 [ DOI: 10.1109/CVPR.2016.182 http://dx.doi.org/10.1109/CVPR.2016.182 ]
Li A , Zhang L , Liu Y and Zhu C . 2023 . Feature modulation Transformer: cross-refinement of global representation via high-frequency prior for image super-resolution // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Paris, France : IEEE: 12480 - 12490 [ DOI: 10.1109/ICCV51070.2023.01150 http://dx.doi.org/10.1109/ICCV51070.2023.01150 ]
Liang J Y , Cao J Z , Sun G L , Zhang K , Van Gool L and Timofte R . 2021 . SwinIR: image restoration using swin Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) . Montreal, Canada : IEEE: 1833 - 1844 [ DOI: 10.1109/ICCVW54120.2021.00210 http://dx.doi.org/10.1109/ICCVW54120.2021.00210 ]
Lim B , Son S , Kim H , Nah S and Lee K M . 2017 . Enhanced deep residual networks for single image super-resolution // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Honolulu, USA : IEEE: 1132 - 1140 [ DOI: 10.1109/CVPRW.2017.151 http://dx.doi.org/10.1109/CVPRW.2017.151 ]
Liu L , Ouyang W L , Wang X Y , Fieguth P , Chen J , Liu X W and Pietikäinen M . 2019 . Deep learning for generic object detection: a survey [EB/OL]. [ 2024-05-28 ]. http://arxiv.org/pdf/1809.02165.pdf http://arxiv.org/pdf/1809.02165.pdf
Liu Z , Lin Y T , Cao Y , Hu H , Wei Y Z , Zhang Z , Lin S and Guo B N . 2021 . Swin Transformer: hierarchical vision Transformer using shifted windows // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Montreal, Canada : IEEE: 9992 - 10002 [ DOI: 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ]
Loshchilov I and Hutter F . 2019 . Decoupled weight decay regularization [EB/OL]. [ 2024-05-28 ]. http://arxiv.org/pdf/1711.05101.pdf http://arxiv.org/pdf/1711.05101.pdf
Martin D , Fowlkes C , Tal D and Malik J . 2001 . A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics // Proceedings of the 8th IEEE International Conference on Computer Vision . Vancouver, Canada : IEEE: 416 - 423 [ DOI: 10.1109/ICCV.2001.937655 http://dx.doi.org/10.1109/ICCV.2001.937655 ]
Matsui Y , Ito K , Aramaki Y , Fujimoto A , Ogawa T , Yamasaki T and Aizawa K . 2017 . Sketch-based manga retrieval using Manga109 dataset . Multimedia Tools and Applications , 76 ( 20 ): 21811 - 21838 [ DOI: 10.1007/s11042-016-4020-z http://dx.doi.org/10.1007/s11042-016-4020-z ]
Mei Y Q , Fan Y C and Zhou Y Q . 2021 . Image super-resolution with non-local sparse attention // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville, USA : IEEE: 3516 - 3525 [ DOI: 10.1109/CVPR46437.2021.00352 http://dx.doi.org/10.1109/CVPR46437.2021.00352 ]
Mei Y Q , Fan Y C , Zhou Y Q , Huang L C , Huang T S and Shi H H . 2020 . Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle, USA : IEEE: 5689 - 5698 [ DOI: 10.1109/CVPR42600.2020.00573 http://dx.doi.org/10.1109/CVPR42600.2020.00573 ]
Niu B , Wen W L , Ren W Q , Zhang X D , Yang L P , Wang S Z , Zhang K H , Cao X C and Shen H F . 2020 . Single image super-resolution via a holistic attention network // Proceedings of the 16th European Conference on Computer Vision . Glasgow, UK : Springer: 191 - 207 [ DOI: 10.1007/978-3-030-58610-2_12 http://dx.doi.org/10.1007/978-3-030-58610-2_12 ]
Wang W H , Xie E Z , Li X , Fan D P , Song K T , Liang D , Lu T , Luo P and Shao L . 2021 . Pyramid vision Transformer: a versatile backbone for dense prediction without convolutions // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Montreal, Canada : IEEE: 548 - 558 [ DOI: 10.1109/ICCV48922.2021.00061 http://dx.doi.org/10.1109/ICCV48922.2021.00061 ]
Wang X T , Chan K C K , Yu K , Dong C and Loy C C . 2019 . EDVR: video restoration with enhanced deformable convolutional networks // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Long Beach, USA : IEEE: 1954 - 1963 [ DOI: 10.1109/CVPRW.2019.00247 http://dx.doi.org/10.1109/CVPRW.2019.00247 ]
Wang Z , Bovik A C , Sheikh H R and Simoncelli E P . 2004 . Image quality assessment: from error visibility to structural similarity . IEEE Transactions on Image Processing , 13 ( 4 ): 600 - 612 [ DOI: 10.1109/TIP.2003.819861 http://dx.doi.org/10.1109/TIP.2003.819861 ]
Woo S , Park J , Lee J Y and Kweon I S . 2018 . CBAM: convolutional block attention module // Proceedings of the 15th European Conference on Computer Vision . Munich, Germany : Springer: 3 - 19 [ DOI: 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ]
Wu B C , Xu C F , Dai X L , Wan A , Zhang P Z , Yan Z C , Tomizuka M , Gonzalez J , Keutzer K and Vajda P . 2020 . Visual Transformers: token-based image representation and processing for computer vision [EB/OL]. [ 2024-05-28 ]. https://arxiv.org/pdf/2006.03677v4.pdf https://arxiv.org/pdf/2006.03677v4.pdf
Xiong W , Xiong C Y , Gao Z R , Chen W Q , Zheng R H and Tian J W . 2023 . Image super-resolution with channel-attention-embedded Transformer . Journal of Image and Graphics , 28 ( 12 ): 3744 - 3757
熊巍 , 熊承义 , 高志荣 , 陈文旗 , 郑瑞华 , 田金文 . 2023 . 通道注意力嵌入的Transformer图像超分辨率重构 . 中国图象图形学报 , 28 ( 12 ): 3744 - 3757 [ DOI: 10.11834/jig.221033 http://dx.doi.org/10.11834/jig.221033 ]
Zeyde R , Elad M and Protter M . 2012 . On single image scale-up using sparse-representations // Proceedings of the 7th International Conference on Curves and Surfaces . Avignon, France : Springer: 711 - 730 [ DOI: 10.1007/978-3-642-27413-8_47 http://dx.doi.org/10.1007/978-3-642-27413-8_47 ]
Zhang H , Zu K K , Lu J , Zou Y R and Meng D Y . 2021 . EPSANet: an efficient pyramid squeeze attention block on convolutional neural network [EB/OL]. [ 2024-05-28 ]. http://arxiv.org/pdf/2105.14447.pdf http://arxiv.org/pdf/2105.14447.pdf
Zhang N , Wang Y C , Zhang X , Xu D D , Wang X D , Ben G , Zhao Z K and Li Z . 2022a . A multi-degradation aided method for unsupervised remote sensing image super resolution with convolution neural networks . IEEE Transactions on Geoscience and Remote Sensing , 60 : # 5600814 [ DOI: 10.1109/TGRS.2020.3042460 http://dx.doi.org/10.1109/TGRS.2020.3042460 ]
Zhang X D , Zeng H , Guo S and Zhang L . 2022b . Efficient long-range attention network for image super-resolution [EB/OL]. [ 2024-05-28 ]. http://arxiv.org/pdf/2203.06697.pdf http://arxiv.org/pdf/2203.06697.pdf
Zhang Y L , Li K P , Li K , Wang L C , Zhong B N and Fu Y . 2018 . Image super-resolution using very deep residual channel attention networks // Proceedings of the 15th European Conference on Computer Vision . Munich, Germany : Springer: 294 - 310 [ DOI: 10.1007/978-3-030-01234-2_18 http://dx.doi.org/10.1007/978-3-030-01234-2_18 ]
Zhou S C , Zhang J W , Zuo W M and Loy C C . 2020 . Cross-scale internal graph neural network for image super-resolution // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver Canada : Curran Associates Inc.: 3499 - 3509
Zhou Y P , Li Z , Guo C L , Bai S , Cheng M M and Hou Q B . 2023 . SRFormer: permuted self-attention for single image super-resolution // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Paris, France : IEEE: 12734 - 12745 [ DOI: 10.1109/ICCV51070.2023.01174 http://dx.doi.org/10.1109/ICCV51070.2023.01174 ]
相关作者
相关机构