结构感知增强与跨模态融合的文本图像超分辨率
Text image super-resolution via structure perception enhancement andcross modal fusion
- 2024年 页码:1-13
网络出版日期: 2024-10-01
DOI: 10.11834/jig.240559
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-10-01 ,
移动端阅览
朱仲杰,张磊,李沛等.结构感知增强与跨模态融合的文本图像超分辨率[J].中国图象图形学报,
ZHU Zhongjie,ZHANG Lei,LI Pei,et al.Text image super-resolution via structure perception enhancement andcross modal fusion[J].Journal of Image and Graphics,
目的
2
场景文本图像超分辨率是一种新兴的视觉增强技术,用于提升低分辨率文本图像的分辨率,从而提高文本可读性。然而,现有方法无法有效提取文本结构动态特征,导致形成的语义先验无法与图像特征有效对齐并融合,进而影响图像重建质量并造成文本识别困难。为此,提出一种基于文本结构动态感知的跨模态融合超分辨率方法以提高文本图像质量和文本可读性。
方法
2
首先构建文本结构动态感知模块,通过方向感知层和上下文关联单元,分别提取文本的多尺度定向特征并解析字符邻域间的上下文联系,精准捕获文本图像的结构动态特征。其次,设计语义空间对齐模块,利用文本掩码信息促进精细化文本语义先验的生成,并通过仿射变换对齐语义先验和图像特征。在此基础上,通过跨模态融合模块结合文本语义先验与图像特征,以自适应权重分配的方式促进跨模态交互融合,输出高分辨率文本图像。
结果
2
在真实数据集TextZoom上与多种主流方法进行对比,实验结果表明所提方法在ASTER、CRNN和MORAN三种文本识别器上的平均识别精度为62.4%,较性能第二的方法有2.6%的提升。此外,所提方法的PSNR和SSIM的指标分别为21.9和0.789,分别处于第一和第二的位置,领先大多数方法。
结论
2
所提方法通过精准捕获文本结构动态特征来指导高级文本语义先验的生成,从而促进文本和图像两种模态的对齐和融合,有效提升了图像重建质量和文本可读性。<p>
目的
2
场景文本图像超分辨率是一种新兴的视觉增强技术,用于提升低分辨率文本图像的分辨率,从而提高文本可读性。然而,现有方法无法有效提取文本结构动态特征,导致形成的语义先验无法与图像特征有效对齐并融合,进而影响图像重建质量并造成文本识别困难。为此,提出一种基于文本结构动态感知的跨模态融合超分辨率方法以提高文本图像质量和文本可读性。</p><p>
方法
2
首先构建文本结构动态感知模块,通过方向感知层和上下文关联单元,分别提取文本的多尺度定向特征并解析字符邻域间的上下文联系,精准捕获文本图像的结构动态特征。其次,设计语义空间对齐模块,利用文本掩码信息促进精细化文本语义先验的生成,并通过仿射变换对齐语义先验和图像特征。在此基础上,通过跨模态融合模块结合文本语义先验与图像特征,以自适应权重分配的方式促进跨模态交互融合,输出高分辨率文本图像。</p><p>
结果
2
在真实数据集TextZoom上与多种主流方法进行对比,实验结果表明所提方法在ASTER、CRNN和MORAN三种文本识别器上的平均识别精度为62.4%,较性能第二的方法有2.6%的提升。此外,所提方法的PSNR和SSIM的指标分别为21.9和0.789,分别处于第一和第二的位置,领先大多数方法。</p><p>
结论
2
所提方法通过精准捕获文本结构动态特征来指导高级文本语义先验的生成,从而促进文本和图像两种模态的对齐和融合,有效提升了图像重建质量和文本可读性。</p>
Objective
2
Scene text image super-resolution (STISR) is an emerging visual enhancement technology, which is specifically designed to improve the quality of low-resolution text images and the readability of the text. It is extensively applied across various domains such as autonomous driving, document retrieval, and text recognition. The existing STISR methods can be broadly divided into traditional methods and deep learning methods. Traditional methods can enhance images to some extent, but rely on manual feature extractors and complex prior knowledge heavily. On the contrary, deep learning methods showcase significant advantages due to their robust automatic feature extraction and learning abilities. Therefore, deep learning STISR methods have become increasingly popular within the academic community. Recent deep learning based STISR methods begin to utilize rich semantic priors from text images to guide both image reconstruction and text recovery. Unlike methods that solely rely on visual feature extraction, like convolutional neural networks, multilayer perceptrons, and transformers, these emerging methods integrate semantic priors learning with visual feature analysis to achieve more efficient image reconstruction. However, they often neglect the dynamic features of text structure, such as the contextual connections between neighboring characters and directional features. This oversight resulting in the generated text semantic priors cannot be aligned and fused with the image features effectually, thus limiting the quality of the reconstructed images. To overcome these limitations, a new cross modal fusion super-resolution method based on text structure dynamic perception is proposed to enhance both the quality of low-resolution text images and the readability of the text.
Method
2
In this study, we propose an innovative cross modal fusion STISR method based on text structure dynamic perception. Initially, low-resolution text images as input are corrected using a spatial transform network module and a thin plate splines module to adjust the spatial positions of irregular characters, thereby facilitating further processing. These images are then processed through a text structure dynamic perception module and a semantic space alignment module to extract both image modal and text modal features. The text structure dynamic perception module, equipped with a direction sensing block and a context linkage unit, captures multi-scale directional features and deciphers the contextual relationships between character neighborhoods respectively, accurately capturing the dynamic features of the text structure within the image modal. The semantic space alignment module processes the low-resolution images through recognition rendering and binarization to derive text semantic priors and mask priors. These priors are then combined in feature addition to generate advanced text semantic priors, which are aligned with image features under the guidance of the image modal through affine transformations. Lastly, the developed cross modal fusion module employs an adaptive weight distribution strategy to enhance the interactive integration of text and image features across modals, producing the final super-resolved text image.
Result
2
The proposed method was compared against 13 mainstream methods. The evaluation primarily focused on the text recognition accuracy of the reconstructed text images (a critical measure due to the unique nature of text images, where readability and recognizability are paramount). Secondary metrics included PSNR and SSIM, which, despite their limitations in reflecting image quality due to misalignment of low-resolution and high-resolution images in real datasets, supplemented the evaluation. The experiments on the real dataset Texezoom demonstrated that the proposed method achieved accuracies of 67.1%, 56.6%, and 63.6% on three standard text recognizers ASTER, CRNN, and MORAN respectively, surpassing the existing representative method PERMR by 2.9%, 4.6%, and 3%. And the PSNR and SSIM values of the proposed method are 21.9 and 0.789 respectively, ranking first and second in all compared methods. In addition, the visual comparison of most of these methods further highlighted the superior quality and readability of text images reconstructed by the proposed approach.
Conclusion
2
In this study, we proposed a novel STISR method that effectively guides the generation of advanced text semantic priors by accurately capturing dynamic text structure features. This method promotes the alignment and integration of text semantic priors and image features, significantly enhancing the quality of reconstructed text images and the readability of the text. The experimental results demonstrate that the proposed method holds substantial advantages over other mainstream methods, effectively elevating both the quality of reconstructed text images and the readability of the text.<p>Objective Scene text image super-resolution (STISR) is an emerging visual enhancement technology, which is specifically designed to improve the quality of low-resolution text images and the readability of the text. It is extensively applied across various domains such as autonomous driving, document retrieval, and text recognition. The existing STISR methods can be broadly divided into traditional methods and deep learning methods. Traditional methods can enhance images to some extent, but rely on manual feature extractors and complex prior knowledge heavily. On the contrary, deep learning methods showcase significant advantages due to their robust automatic feature extraction and learning abilities. Therefore, deep learning STISR methods have become increasingly popular within the academic community. Recent deep learning based STISR methods begin to utilize rich semantic priors from text images to guide both image reconstruction and text recovery. Unlike methods that solely rely on visual feature extraction, like convolutional neural networks, multilayer perceptrons, and transformers, these emerging methods integrate semantic priors learning with visual feature analysis to achieve more efficient image reconstruction. However, they often neglect the dynamic features of text structure, such as the contextual connections between neighboring characters and directional features. This oversight resulting in the generated text semantic priors cannot be aligned and fused with the image features effectually, thus limiting the quality of the reconstructed images. To overcome these limitations, a new cross modal fusion super-resolution method based on text structure dynamic perception is proposed to enhance both the quality of low-resolution text images and the readability of the text.</p><p>Method In this study, we propose an innovative cross modal fusion STISR method based on text structure dynamic perception. Initially, low-resolution text images as input are corrected using a spatial transform network module and a thin plate splines module to adjust the spatial positions of irregular characters, thereby facilitating further processing. These images are then processed through a text structure dynamic perception module and a semantic space alignment module to extract both image modal and text modal features. The text structure dynamic perception module, equipped with a direction sensing block and a context linkage unit, captures multi-scale directional features and deciphers the contextual relationships between character neighborhoods respectively, accurately capturing the dynamic features of the text structure within the image modal. The semantic space alignment module processes the low-resolution images through recognition rendering and binarization to derive text semantic priors and mask priors. These priors are then combined in feature addition to generate advanced text semantic priors, which are aligned with image features under the guidance of the image modal through affine transformations. Lastly, the developed cross modal fusion module employs an adaptive weight distribution strategy to enhance the interactive integration of text and image features across modals, producing the final super-resolved text image.</p><p>Result The proposed method was compared against 13 mainstream methods. The evaluation primarily focused on the text recognition accuracy of the reconstructed text images (a critical measure due to the unique nature of text images, where readability and recognizability are paramount). Secondary metrics included PSNR and SSIM, which, despite their limitations in reflecting image quality due to misalignment of low-resolution and high-resolution images in real datasets, supplemented the evaluation. The experiments on the real dataset Texezoom demonstrated that the proposed method achieved accuracies of 67.1%, 56.6%, and 63.6% on three standard text recognizers ASTER, CRNN, and MORAN respectively, surpassing the existing representative method PERMR by 2.9%, 4.6%, and 3%. And the PSNR and SSIM values of the proposed method are 21.9 and 0.789 respectively, ranking first and second in all compared methods. In addition, the visual comparison of most of these methods further highlighted the superior quality and readability of text images reconstructed by the proposed approach.</p><p>Conclusion In this study, we proposed a novel STISR method that effectively guides the generation of advanced text semantic priors by accurately capturing dynamic text structure features. This method promotes the alignment and integration of text semantic priors and image features, significantly enhancing the quality of reconstructed text images and the readability of the text. The experimental results demonstrate that the proposed method holds substantial advantages over other mainstream methods, effectively elevating both the quality of reconstructed text images and the readability of the text.</p>
场景文本图像超分辨率文本结构动态特征多尺度定向特征语义空间对齐跨模态融合
scene text image super resolutiondynamic features of text structuremulti-scale orientation featuresemantic space alignmentcross modal fusion
Chang H, Yeung D Y and Xiong Y. 2004. Super-resolution through neighbor embedding//Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington, DC, USA: IEEE: I-I [DOI: 10.1109/CVPR.2004.1315043http://dx.doi.org/10.1109/CVPR.2004.1315043]
Chen J, Li B and Xue X. 2021. Scene text telescope: Text-focused scene image super-resolution//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA: IEEE: 12026-12035 [DOI: 10.1109/CVPR46437.2021.01185http://dx.doi.org/10.1109/CVPR46437.2021.01185]
Chen J, Yu H, Ma J, Li B and Xue X. 2022. Text gestalt: Stroke-aware scene text image super-resolution//Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, BC, Canada: AAAI: 285-293 [DOI: 10.1609/AAAI.V36I1.19904http://dx.doi.org/10.1609/AAAI.V36I1.19904]
Dong C, Loy CC, He K and Tang X. 2014. Learning a deep convolutional network for image super-resolution//Computer Vision–ECCV 2014: 13th European Conference. Zurich, Switzerland: Springer: 184-199 [DOI: 10.1007/978-3-319-10593-2_13http://dx.doi.org/10.1007/978-3-319-10593-2_13]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani Met al. 2020. An image is worth16x16 words: Transformers for image recognition at scale. [EB/OL]. [2024-08-12]. https://arxiv.org/abs/2010.11929https://arxiv.org/abs/2010.11929
Farsiu S, Robinson D, Elad M and Milanfar P. 2004. Advances and challenges in super‐resolution. International Journal of Imaging Systems and Technology, 14(2): 47-57 [DOI: https://doi.org/10.1002/ima.20007http://dx.doi.org/https://doi.org/10.1002/ima.20007]
Glasner D, Bagon S and Irani M. 2009. Super-resolution from a single image//IEEE 12th international conference on computer vision. Kyoto, Japan: IEEE: 349-356 [DOI: 10.1109/ICCV.2009.5459271http://dx.doi.org/10.1109/ICCV.2009.5459271]
Graves A, Fernández S, Gomez F and Schmidhuber J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks//Proceedings of the 23rd international conference on Machine learning. Pittsburgh, Pennsylvania, USA: ACM: 369-376 [DOI: https://doi.org/10.1145/1143844.1143891http://dx.doi.org/https://doi.org/10.1145/1143844.1143891]
Guo H, Dai T, Yang MEng GZ and Xia S. 2023. Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement//Proceedings of the International Joint Conference on Artificial Intelligence. Macao, China: ACM: 782 - 790 [DOI: https://doi.org/10.24963/ijcai.2023/87http://dx.doi.org/https://doi.org/10.24963/ijcai.2023/87]
Hao Y, Madani S, Guan J, Alloulah M, Gupta S and Hassanieh H. 2024. Bootstrapping Autonomous Driving Radars with Self-Supervised Learning//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE:15012-15023 [DOI: https://arxiv.org/html/2312.04519v3http://dx.doi.org/https://arxiv.org/html/2312.04519v3]
Li X and Orchard MT. 2001. New edge-directed interpolation. IEEE Transactions on Image Processing, 10(10): 1521-1527. [DOI: 10.1109/83.951537http://dx.doi.org/10.1109/83.951537]
Luo C, Jin L and Sun Z. 2019. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90: 109-118. [DOI: https://doi.org/10.1016/j.patcog.2019.01.020http://dx.doi.org/https://doi.org/10.1016/j.patcog.2019.01.020]
Ma J, Guo S and Zhang L. 2023. Text prior guided scene text image super-resolution. IEEE Transactions on Image Processing, 32: 1341-1353. [DOI: 10.1109/TIP.2023.3237002http://dx.doi.org/10.1109/TIP.2023.3237002]
Ma J, Liang Z and Zhang L. 2022. A text attention network for spatial deformation robust scene text image super-resolution//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA: IEEE: 5911-5920. [DOI: 10.1109/CVPR52688.2022.00582http://dx.doi.org/10.1109/CVPR52688.2022.00582]
Pourkeshavarz M, Zhang J and Rasouli A. 2024. CaDeT: a Causal Disentanglement Approach for Robust Trajectory Prediction in Autonomous Driving//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 14874-14884 [DOI: https://arxiv.org/html/2312.04519v3http://dx.doi.org/https://arxiv.org/html/2312.04519v3]
Qiao L, Li Z S, Cheng Z Z and Li X. 2023. SCID : a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images. Journal of Image and Graphics,28(08):2298-2313
乔梁,李再升,程战战,李玺. 2023. SCID:用于 富含视觉信息文档图像中信息提取任务的扫描中文票据数据集. 中国图象图形学报,28(08):2298-2313[DOI:10. 11834/jig. 220911http://dx.doi.org/10.11834/jig.220911]
Qin H, Li Y J, Liang Q K and Wang Y N. 2023. AsymcNet: a document images-relevant asymmetric geometry correction network. Journal of Image and Graphics,28(08):2314-2329
秦海,李艺杰,梁桥康,王耀南. 2023. 针对文档图像的非对称式几何校正网络. 中国图象图形学报,28(08):2314-2329[DOI:10. 11834/jig. 220426http://dx.doi.org/10.11834/jig.220426]
Fang S, Xie H, Wang Y, Mao Z and Zhang Y. 2021. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA: IEEE: 7098-7107 [DOI: 10.1109/CVPR46437.2021.00702http://dx.doi.org/10.1109/CVPR46437.2021.00702]
Shi B, Yang M, Wang X, Lyu P, Yao C and Bai X. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9): 2035-2048 [DOI: 10.1109/TPAMI.2018.2848939http://dx.doi.org/10.1109/TPAMI.2018.2848939]
Shi B, Bai X and Yao C. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11): 2298-2304 [DOI: 10.1109/TPAMI.2016.2646371http://dx.doi.org/10.1109/TPAMI.2016.2646371]
Shi Q, Zhu Y, Liu Y, Ye J and Yang D. 2023. Perceiving Multiple Representations for scene text image super-resolution guided by text recognizer. Engineering Applications of Artificial Intelligence, 124: 106551 [DOI: https://doi.org/10.1016/j.engappai.2023.106551http://dx.doi.org/https://doi.org/10.1016/j.engappai.2023.106551]
Shu R, Zhao C, Feng S, Zhu L and Miao D. 2023. Text-Enhanced Scene Image Super-Resolution via Stroke Mask and Orthogonal Attention. IEEE Transactions on Circuits and Systems for Video Technology, 33(11): 6317 - 6330 [DOI: 10.1109/TCSVT.2023.3267133http://dx.doi.org/10.1109/TCSVT.2023.3267133]
Tran HTM and Ho-Phuoc T. 2019. Deep laplacian pyramid network for text images super-resolution//2019 IEEE-RIVF International Conference on Computing and Communication Technologies. Danang, Vietnam: IEEE: 1-6 [DOI: 10.1109/RIVF.2019.8713657http://dx.doi.org/10.1109/RIVF.2019.8713657]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł and Polosukhin I. 2017. Attention is all you need// Advances in Neural Information Processing Systems. Long Beach, USA: ACM: 6000 - 6010 [DOI: 10.5555/3295222.3295349http://dx.doi.org/10.5555/3295222.3295349]
Wang W, Xie E, Liu X, Wang W, Liang D, Shen C and Bai X. 2020. Scene text image super-resolution in the wild//Computer Vision–ECCV 2020: 16th European Conference. Glasgow, UK: Springer: 650-666 [DOI: 10.1007/978-3-030-58607-2_38http://dx.doi.org/10.1007/978-3-030-58607-2_38]
Yang J, Wright J, Huang TS and Ma Y. 2010. Image super-resolution via sparse representation. IEEE Transactions on Image Processing, 19(11): 2861-2873 [DOI: 10.1109/TIP.2010.2050625http://dx.doi.org/10.1109/TIP.2010.2050625]
Zhang W, Deng X, Jia B, Yu X, Chen Y, Ma J, Ding Q and Zhang X. 2023. Pixel adapter: A graph-based post-processing approach for scene text image super-resolution//Proceedings of the 31st ACM International Conference on Multimedia. Ottawa, Canada: ACM: 2168-2179 [DOI: 10.1145/3581783.3611913http://dx.doi.org/10.1145/3581783.3611913]
Zhu S, Zhao Z, Fang P and Xue H. 2023. Improving scene text image super-resolution via dual prior modulation network//Proceedings of the AAAI Conference on Artificial Intelligence. Washington, DC, USA: AAAI: 3843-3851 [DOI: 10.1609/aaai.v37i3.25497http://dx.doi.org/10.1609/aaai.v37i3.25497]
Zhu Z, Zhang L, Bai Y, Wang Y and Li P. 2024. Scene Text Image Super-Resolution Through Multi-Scale Interaction of Structural and Semantic Priors. IEEE Transactions on Artificial Intelligence [DOI: 10.1109/TAI.2024.3375836http://dx.doi.org/10.1109/TAI.2024.3375836]
相关文章
相关作者
相关机构