自然场景文本检测与识别的深度学习方法

刘崇宇; 陈晓雪; 罗灿杰; 金连文; 薛洋; 刘禹良

doi:10.11834/jig.210044

图像处理与通信技术 | 浏览量 : 0 下载量: 0 CSCD: 14

PDF
导出
分享
收藏
专辑

自然场景文本检测与识别的深度学习方法
Deep learning methods for scene text detection and recognition
2021年26卷第6期页码：1330-1367
纸质出版日期： 2021-06-16 ，

录用日期： 2021-03-06
DOI： 10.11834/jig.210044
稿件说明：

移动端阅览

刘崇宇, 陈晓雪, 罗灿杰, 金连文, 薛洋, 刘禹良. 自然场景文本检测与识别的深度学习方法[J]. 中国图象图形学报, 2021,26(6):1330-1367.

Chongyu Liu, Xiaoxue Chen, Canjie Luo, Lianwen Jin, Yang Xue, Yuliang Liu. Deep learning methods for scene text detection and recognition[J]. Journal of Image and Graphics, 2021,26(6):1330-1367.
刘崇宇, 陈晓雪, 罗灿杰, 金连文, 薛洋, 刘禹良. 自然场景文本检测与识别的深度学习方法[J]. 中国图象图形学报, 2021,26(6):1330-1367. DOI： 10.11834/jig.210044.

Chongyu Liu, Xiaoxue Chen, Canjie Luo, Lianwen Jin, Yang Xue, Yuliang Liu. Deep learning methods for scene text detection and recognition[J]. Journal of Image and Graphics, 2021,26(6):1330-1367. DOI： 10.11834/jig.210044.

摘要

许多自然场景图像中都包含丰富的文本，它们对于场景理解有着重要的作用。随着移动互联网技术的飞速发展，许多新的应用场景都需要利用这些文本信息，例如招牌识别和自动驾驶等。因此，自然场景文本的分析与处理也越来越成为计算机视觉领域的研究热点之一，该任务主要包括文本检测与识别。传统的文本检测和识别方法依赖于人工设计的特征和规则，且模型设计复杂、效率低、泛化性能差。随着深度学习的发展，自然场景文本检测、自然场景文本识别以及端到端的自然场景文本检测与识别都取得了突破性的进展，其性能和效率都得到了显著提高。本文介绍了该领域相关的研究背景，对基于深度学习的自然场景文本检测、识别以及端到端自然场景文本检测与识别的方法进行整理分类、归纳和总结，阐述了各类方法的基本思想和优缺点。并针对隶属于不同类别下的方法，进一步论述和分析这些主要模型的算法流程、适用场景和技术发展路线。此外，列举说明了部分主流公开数据集，对比了各个模型方法在代表性数据集上的性能情况。最后总结了目前不同场景数据下的自然场景文本检测、识别及端到端自然场景文本检测与识别算法的局限性以及未来的挑战和发展趋势。

Abstract

With the rapid development of internet and mobile internet technologies

many new applications require extensive use of rich text information in natural scenarios

such as sign board recognition and automatic driving. Thus

the analysis and processing of scene text plays an essential role in this field and has increasingly become one of the research hotspots in the field of computer vision. Traditional text detection and recognition methods often rely on manually designed features

with large amount of computation and low efficiency. These methods also lack satisfactory generalization performance for complex scenes. With the development of deep learning in recent years

convolutional neural network has made great progress on scene text detection and recognition. These deep learning-based methods outperform traditional ones by a large margin and have already become the mainstream in the field of text reading in the wild. For scene text detection

the methods can be divided into two categories in accordance with the difference of target objects

as follows: top-down methods and bottom-up methods. Top-down methods mainly inherit the basic idea from general object detection or instance segmentation and directly regress the entire bounding box for the text instance. On the contrary

bottom-up methods

following the idea of traditional ones

initially detect some components of the text instance and then group them together through some rules. Bottom-up methods is more effective in processing text detection of arbitrary shapes and orientations than the top-down methods

and they are not as sensitive to text scaling as top-down methods. However

grouping the detected components into different text instances requires complex design and processing; thus

the inference stage of bottom-up approach becomes inefficient. These methods also encounter some difficulties when detecting long text. In addition

text conglutination occurs when detecting dense text. However

the top-down methods do not have this issue and can have a higher precision for text detection. In recent years

recognizing text in natural scenes (also known as scene text recognition (STR)) has aroused great interest in academia and industry. In particular

the objective of STR is to translate a cropped text instance image into a target string sequence. Although optical character recognition (OCR) in scanned documents has been well developed

STR remains challenging due to many factors (such as very complex backgrounds

various

fonts and imperfect imaging conditions). Early work has relied on hand-crafted features

such as histogram of oriented gradients descriptors

connected components

and stroke width transformation. However

the performance of these approaches is limited by the low capability of features. In recent years

with the increase and development of deep learning

the community has witnessed substantial advancements. In particular

scene text recognition approaches based on deep learning can be roughly divided into two branches

namely

segmentation-based approaches and segmentation-free approaches. Segmentation-based approaches attempt to locate the position of each character from the input text instance image

apply a character classifier to recognize each character

and then group characters into text lines to obtain the final recognition results. Segmentation-free approaches recognize the text instance image as a whole and focus on mapping the entire text instance image into a target string sequence directly. Both branches own their advantages and limitations. Therefore

practitioners should select the best trade-offs according to their needs under different application scenarios. In the previous decades

although the practicality and efficiency of recognition approaches have been significantly improved

future research is still required for generalization ability

evaluation protocols

and scenarios of STR. Finally

end-to-end scene text spotting aims to combine text detection and text recognition into a unified system

which can be optimized in a single pipeline. Bridging the gap between the detection branch and recognition branch is the most essential problem for the design of an end-to-end text spotting system. Similar to general object detection and instance segmentation

end-to-end text spotting methods can be divided into two categories

namely

two-stage methods and one-stage methods. Two-stage methods are mainly based on faster R-CNN(region convolutional neural network) and mask R-CNN

in which region of interest(RoI) pooling/align acts as a bridge between the two branches. However

these operations may lose some information given that the region proposals from region proposal network (RPN) are insufficiently accurate. One-stage methods follow the pipeline of detection then recognition. Various feature-align operations are carefully designed to boost the linking between detection and recognition branches. We sort out and summarize the detection and recognition methods of scene text

and further elaborate and analyze the basic ideas of various methods and their pros and cons. We aim to provide reference for researchers and help in future work.

关键词

自然场景文本检测自然场景文本识别(STR)端到端自然场景文本检测与识别深度学习光学字符识别(OCR)综述

Keywords

scene text detectionscene text recognition(STR)end-to-end scene text spottingdeep learningoptical character recognition(OCR)survey

references

Almazán J, Gordo A, Fornés A and Valveny E. 2014. Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12): 2552-2566[DOI:10.1109/TPAMI.2014.2339814]

Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh S J and Lee H. 2019a. What is wrong with scene text recognition model comparisons? Dataset and model analysis//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4714-4722[DOI: 10.1109/ICCV.2019.00481http://dx.doi.org/10.1109/ICCV.2019.00481]

Baek Y, Lee B, Han D, Yun S and Lee H. 2019b. Character region awareness for text detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9365-9374[DOI: 10.1109/CVPR.2019.00959http://dx.doi.org/10.1109/CVPR.2019.00959]

Baek Y, Shin S, Baek J, Park S, Lee J, Nam D and Lee H. 2020. Character region attention for text spotting//Proceeding of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 504-521[DOI: 10.1007/978-3-030-58526-6_30http://dx.doi.org/10.1007/978-3-030-58526-6_30]

Bahdanau D, Cho K and Bengio Y. 2015. Neural machine translation by jointly learning to align and translate//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: [s. n.]

Bissacco A, Cummins M, Netzer Y and Neven H. 2013. PhotoOCR: Reading text in uncontrolled conditions//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 785-792[DOI: 10.1109/ICCV.2013.102http://dx.doi.org/10.1109/ICCV.2013.102]

Bookstein F L. 1989. Thin-plate splines and the decomposition of deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(6): 567-585[DOI:10.1109/34.24792]

Breiman L. 2001. Random forests. Machine Learning, 45(1): 5-32[DOI:10.1023/A:1010933404324]

Busta M, Neumann L and Matas J. 2017. Deep textspotter: an end-to-end trainable scene text localization and recognition framework//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy, 2223-2231. [DOI: 10.1109/ICCV.2017.242http://dx.doi.org/10.1109/ICCV.2017.242].

Canny J. 1986. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8(6): 679-698[DOI:10.1109/TPAMI.1986.4767851]

Casey R G and Lecolinet E. 1996. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7): 690-706[DOI:10.1109/34.506792]

Chen X X, Wang T W, Zhu Y Z, Jin L W and Luo C J. 2020. Adaptive embedding gate for attention-based scene text recognition. Neurocomputing, 381: 261-271[DOI:10.1016/j.neucom.2019.11.049]

Cheng Z Z, Bai F, Xu Y L, Zheng G, Pu S L and Zhou S G. 2017. Focusing attention: towards accurate text recognition in natural images//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5086-5094[DOI: 10.1109/ICCV.2017.543http://dx.doi.org/10.1109/ICCV.2017.543]

Cheng Z Z, Xu Y L, Bai F, Niu Y, Pu S L and Zhou S G. 2018. AON: towards arbitrarily-oriented text recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5571-5579[DOI: 10.1109/CVPR.2018.00584http://dx.doi.org/10.1109/CVPR.2018.00584]

Ch'ng C K and Chan C S. 2017. Total-text: a comprehensive dataset for scene text detection and recognition//Proceedings of the 14th International Conference on Document Analysis and Recognition. Kyoto, Japan: IEEE: 935-942[DOI: 10.1109/ICDAR.2017.157http://dx.doi.org/10.1109/ICDAR.2017.157]

Chng C K, Liu Y L, Sun Y P, Ng C C, Luo C J, Ni Z H, Fang C M, Zhang S T, Han J Y, Ding E R, Liu J T, Karatzas D, Chan C S and Jin L W. 2019. ICDAR2019 robust reading challenge on arbitrary-shaped text-RRC-ArT//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney, Australia: IEEE: 1571-1576[DOI: 10.1109/ICDAR.2019.00252http://dx.doi.org/10.1109/ICDAR.2019.00252]

Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics: 1724-1734[DOI: 10.3115/v1/D14-1179http://dx.doi.org/10.3115/v1/D14-1179]

Cong F Z, Hu W P, Huo Q and Guo L. 2019. A comparative study of attention-based encoder-decoder approaches to natural scene text recognition//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney, Australia: IEEE: 916-921[DOI: 10.1109/ICDAR.2019.00151http://dx.doi.org/10.1109/ICDAR.2019.00151]

Dai J F, Li Y, He K M and Sun J. 2016. R-FCN: object detection via region-based fully convolutional networks//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: ACM: 379-387

Dai Y C, Huang Z, Gao Y T, Xu Y X, Chen K, Guo J and Qiu W D. 2018. Fused text segmentation networks for multi-oriented scene text detection//Proceedings of the 24th International Conference on Pattern Recognition. Beijing, China: IEEE: 3604-3609[DOI: 10.1109/ICPR.2018.8546066http://dx.doi.org/10.1109/ICPR.2018.8546066]

Deng D, Liu H F, Li X L and Cai D. 2018. Pixellink: detecting scene text via instance segmentation//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). New Orleans, USA: AAAI: 6773-6780

Dollár P, Appel R, Belongie S and Perona P. 2014. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8): 1532-1545[DOI:10.1109/TPAMI.2014.2300479]

Dong C, Loy C C, He K M and Tang X O. 2016. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2): 295-307[DOI:10.1109/TPAMI.2015.2439281]

Epshtein B, Ofek E and Wexler Y. 2010. Detecting text in natural scenes with stroke width transform//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 2963-2970[DOI: 10.1109/cvpr.2010.5540041http://dx.doi.org/10.1109/cvpr.2010.5540041]

Fang S C, Xie H T, Chen J J, Tan J L and Zhang Y D. 2019. Learning to draw text in natural images with conditional adversarial networks//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: IJCAI: 715-722[DOI: 10.24963/ijcai.2019/101http://dx.doi.org/10.24963/ijcai.2019/101]

Fang S C, Xie H T, Zha Z J, Sun N N, Tan J L and Zhang Y D. 2018. Attention and language ensemble for scene text recognition with convolutional sequence modeling//Proceedings of the 26th ACM International Conference on Multimedia. Seoul, Korea(South): ACM: 248-256[DOI: 10.1145/3240508.3240571http://dx.doi.org/10.1145/3240508.3240571]

Feng W, He W H, Yin F, Zhang X Y and Liu C L. 2019a. TextDragon: an end-to-end framework for arbitrary shaped text spotting//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9076-9085[DOI: 10.1109/ICCV.2019.00917http://dx.doi.org/10.1109/ICCV.2019.00917]

Feng X J, Yao H X and Zhang S P. 2019b. Focal CTC loss for Chinese optical character recognition on unbalanced datasets. Complexity, 2019: #9345861[DOI:10.1155/2019/9345861]

Gao Y Z, Chen Y Y, Wang J Q, Tang M and Lu H Q. 2018. Dense chained attention network for scene text recognition//Proceedings of the 25th IEEE International Conference on Image Processing. Athens, Greece: IEEE: 679-683[DOI: 10.1109/ICIP.2018.8451273http://dx.doi.org/10.1109/ICIP.2018.8451273]

Gao Y Z, Chen Y Y, Wang J Q, Tang M and Lu H Q. 2019. Reading scene text with fully convolutional sequence modeling. Neurocomputing, 339: 161-170[DOI:10.1016/j.neucom.2019.01.094]

Goel V, Mishra A, Alahari K and Jawahar C V. 2013. Whole is greater than sum of parts: recognizing scene text words//Proceedings of the 12th International Conference on Document Analysis and Recognition. Washington, USA: IEEE: 398-402[DOI: 10.1109/ICDAR.2013.87http://dx.doi.org/10.1109/ICDAR.2013.87]

Gomez R, Shi B G, Gomez L, Numann L, Veit A, Matas J, Belongie S and Karatzas D. 2017. ICDAR2017 robust reading challenge on COCO-text//Proceeding of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto, Japan: IEEE: 1435-1443[DOI: 10.1109/ICDAR.2017.234http://dx.doi.org/10.1109/ICDAR.2017.234]

Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y. 2014. Generative adversarial nets//Proceedings of the 27th Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 2672-2680

Goodfellow I J, Warde-Farley D, Mirza M, Courville A and Bengio Y. 2013. Maxout networks//Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA: ACM: 1319-1327

Gordo A. 2015. Supervised mid-level features for word image representation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 2956-2964[DOI: 10.1109/CVPR.2015.7298914http://dx.doi.org/10.1109/CVPR.2015.7298914]

Graves A. 2012. Supervised sequence labelling//Graves A, ed. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin, Heidelberg: Springer: 5-13[DOI: 10.1007/978-3-642-24797-2_2http://dx.doi.org/10.1007/978-3-642-24797-2_2]

Graves A and Jaitly N. 2014. Towards end-to-end speech recognition with recurrent neural networks//Proceedings of the 31st International Conference on Machine Learning. Bejing, China: JMLR: 1764-1772

Graves A, Liwicki M, Fernandez S, Bertolami R, Bunke H and Schmidhuber R. 2009. A novelconnectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5): 855-868[DOI:10.1109/TPAMI.2008.137]

Graves A, Fernández S, Gomez F and Schmidhuber J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks//Proceedings of the 23rd international conference on Machine learning. Pittsburgh, USA: ACM: 369-376[DOI: 10.1145/1143844.1143891http://dx.doi.org/10.1145/1143844.1143891]

Graves A, Mohamed A R and Hinton G. 2013. Speech recognition with deep recurrent neural networks//Proceedings of 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada: IEEE: 6645-6649[DOI: 10.1109/ICASSP.2013.6638947http://dx.doi.org/10.1109/ICASSP.2013.6638947]

Graves A and Schmidhuber J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5/6): 602-610[DOI:10.1016/j.neunet.2005.06.042]

Guo Q, Wang F L, Lei J, Tu D and Li G H. 2016. Convolutional feature learning and hybrid CNN-HMM for scene number recognition. Neurocomputing, 184: 78-90[DOI:10.1016/j.neucom.2015.07.135]

Gupta A, Vedaldi A and Zisserman A. 2016. Synthetic data for text localisation in natural images//Proceedings of 2016 IEEE conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2315-2324[DOI: 10.1109/CVPR.2016.254http://dx.doi.org/10.1109/CVPR.2016.254]

He K M, Gkioxari G, Dollár P and Girshick R. 2017a. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2961-2969[DOI: 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322]

He K M, Zhang X Y, Ren S Q and Sun J. 2016a. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

He P, Huang W L, He T, Zhu Q L, Qiao Y and Li X L. 2017b. Single shot text detector with regional attention//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3047-3055[DOI: 10.1109/ICCV.2017.331http://dx.doi.org/10.1109/ICCV.2017.331]

He P, Huang W L, Qiao Y, Loy C C and Tang X O. 2016b. Reading scene text in deep convolutional sequences//Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix, USA: AAAI: 3501-3508

He T, Huang W L, Qiao Y and Yao J. 2016c. Accurate text localization in natural image with cascaded convolutional text network[EB/OL]. [2021-01-21].https://arxiv.org/pdf/1603.09423.pdfhttps://arxiv.org/pdf/1603.09423.pdf

He W H, Zhang X Y, Yin F and Liu C L. 2017c. Deep direct regression for multi-oriented scene text detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 745-753[DOI: 10.1109/ICCV.2017.87http://dx.doi.org/10.1109/ICCV.2017.87]

He X W, Yang Y, Shi B G and Bai X. 2019. VD-SAN: visual-densely semantic attention network for image caption generation. Neurocomputing, 328: 48-55[DOI:10.1016/j.neucom.2018.02.106]

He T, Tian Z, Huang W, Shen C, Qiao Y and Sun C. 2018. An end-to-end textspotter with explicit alignment and attention//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, 5020-5029[doi:10.1109/CVPR.2018.00527http://dx.doi.org/10.1109/CVPR.2018.00527].

Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8): 1735-1780[DOI:10.1162/neco.1997.9.8.1735]

Hu H, Zhang C Q, Luo Y X, Wang Y Z, Han J Y and Ding E R. 2017. Wordsup: exploiting word annotations for character based text detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 4940-4949[DOI: 10.1109/ICCV.2017.529http://dx.doi.org/10.1109/ICCV.2017.529]

Hu W Y, Cai X C, Hou J, Yi S and Lin Z P. 2020. GTC: guided training of ctc towards efficient and accurate scene text recognition//Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, the 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020. New York, USA: AAAI: 11005-11012

Huang G, Liu Z, Van Der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4700-4708[DOI: 10.1109/CVPR.2017.243http://dx.doi.org/10.1109/CVPR.2017.243]

Huang L C, Yang Y, Deng Y F and Yu Y N. 2015. Densebox: unifying landmark localization with end to end object detection[EB/OL]. [2021-01-21].https://arxiv.org/pdf/1509.04874.pdfhttps://arxiv.org/pdf/1509.04874.pdf

Huang W L, Lin Z, Yang J C and Wang J. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors//Proceedings of 2013 International Conference on Computer Vision. Sydney, Australia: IEEE: 1241-1248[DOI: 10.1109/ICCV.2013.157http://dx.doi.org/10.1109/ICCV.2013.157]

Huang Y L, Sun Z H, Jin L W and Luo C J. 2020. EPAN: effective parts attention network for scene text recognition. Neurocomputing, 376: 202-213[DOI:10.1016/j.neucom.2019.10.010]

Jaderberg M, Simonyan K, Vedaldi A and Zisserman A. 2015a. Deep structured output learning for unconstrained text recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: [s. n.]

Jaderberg M, Simonyan K, Vedaldi A and Zisserman A. 2016. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1): 1-20[DOI:10.1007/s11263-015-0823-z]

Jaderberg M, Simonyan K, Zisserman A and Kavukcuoglu K. 2015b. Spatial transformer networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: ACM: 2017-2025

Jaderberg M, Vedaldi A and Zisserman A. 2014. Deep features for text spotting//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 512-528[DOI: 10.1007/978-3-319-10593-2_34http://dx.doi.org/10.1007/978-3-319-10593-2_34]

Jiang Y Y, Zhu X Y, Wang X B, Yang S L, Li W, Wang H, Fu P and Luo Z B. 2018. R2CNN: rotational region CNN for Arbitrarily-oriented scene text detection//Proceedings of the 24th International Conference on Pattern Recognition. Beijing, China: IEEE: 3610-3615[DOI: 10.1109/ICPR.2018.8545598http://dx.doi.org/10.1109/ICPR.2018.8545598].

Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar V R, Lu S J, Shafait F, Uchida S and Valveny E. 2015. ICDAR 2015 competition on robust reading//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis, Tunisia: IEEE: 1156-1160[DOI: 10.1109/ICDAR.2015.7333942http://dx.doi.org/10.1109/ICDAR.2015.7333942]

Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda L G I, Mestre S R, Mas J, Mota D F, Almazàn J A and de las Heras L. 2013. ICDAR 2013 robust reading competition//Proceeding of the 12th International Conference on Document Analysis and Recognition. Washington, USA: IEEE: 1484-1493[DOI: 10.1109/ICDAR.2013.221http://dx.doi.org/10.1109/ICDAR.2013.221]

Kim K I, Jung K and Kim J H. 2003. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12): 1631-1639[DOI:10.1109/TPAMI.2003.1251157]

LeCun Y, Bottou L, Bengio Y and Haffner P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278-2324[DOI:10.1109/5.726791]

Lee C Y and Osindero S. 2016. Recursive recurrent nets with attention modeling for OCR in the wild//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2231-2239[DOI: 10.1109/CVPR.2016.245http://dx.doi.org/10.1109/CVPR.2016.245]

Li H, Wang P and Shen C H. 2017a. Towards end-to-end text spotting with convolutional recurrent neural networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5238-5246[DOI: 10.1109/ICCV.2017.560http://dx.doi.org/10.1109/ICCV.2017.560]

Li H, Wang P, Shen C H and Zhang G Y. 2019. Show, attend and read: a simple and strong baseline for irregular text recognition//Proceedings ofthe 33rd Conference on Artificial Intelligence, AAAI 2019, the 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. Honolulu, USA: AAAI: 8610-8617[DOI: 10.1609/aaai.v33i01.33018610http://dx.doi.org/10.1609/aaai.v33i01.33018610]

Li Y, Qi H Z, Dai J F, Ji X Y and Wei Y C. 2017b. Fully convolutional instance-aware semantic segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2359-2367[DOI: 10.1109/CVPR.2017.472http://dx.doi.org/10.1109/CVPR.2017.472]

Liang M and Hu X L. 2015. Recurrent convolutional neural network for object recognition//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3367-3375[DOI: 10.1109/CVPR.2015.7298958http://dx.doi.org/10.1109/CVPR.2015.7298958]

Liao M H, Lyu P, He M H, Yao C, Wu W H and Bai X. 2021. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2): 532-548[DOI:10.1109/TPAMI.2019.2937086]

Liao M H, Pang G, Huang J, Hassner T and Bai X. 2020a. Mask TextSpotter v3: segmentation proposal network for robust scene text spotting//Proceeding of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 706-722[DOI: 10.1007/978-3-030-58621-8_41http://dx.doi.org/10.1007/978-3-030-58621-8_41]

Liao M H, Shi B G and Bai X. 2018a. TextBoxes++: a single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8): 3676-3690[DOI:10.1109/TIP.2018.2825107]

Liao M H, Shi B G, Bai X, Wang X G and Liu W Y. 2017. TextBoxes: a fast text detector with a single deep neural network//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: ACM: 4161-4167

Liao M H, Wan Z Y, Yao C, Chen K and Bai X. 2020b. Real-time scene text detection with differentiable binarization//Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, the 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020. New York, USA: AAAI: 11474-11481

Liao M H, Zhu Z, Shi B G, Xia G S and Bai X. 2018b. Rotation-sensitive regression for oriented scene text detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5909-5918[DOI: 10.1109/CVPR.2018.00619http://dx.doi.org/10.1109/CVPR.2018.00619]

Litman R, Anschel O, Tsiper S, Litman R, Mazor S and Manmatha R. 2020. SCATTER: selective context attentional scene text recognizer//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11959-11969[DOI: 10.1109/CVPR42600.2020.01198http://dx.doi.org/10.1109/CVPR42600.2020.01198]

Liu C L, Koga M and Fujisawa H. 2002. Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(11): 1425-1437[DOI:10.1109/TPAMI.2002.1046151]

Liu H, Jin S and Zhang C S. 2018a. Connectionist temporal classification with maximum entropy regularization//Proceedings of 2018 Annual Conference on Neural Information Processing Systems. Montréal, Canada: [s. n.]: 831-841

Liu J C, Liu X B, Sheng J, Liang D, Li X and Liu Q J. 2019a. Pyramid mask text detector[EB/OL]. [2021-01-21].https://arxiv.org/pdf/1903.11800.pdfhttps://arxiv.org/pdf/1903.11800.pdf

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016a. SSD: single shot multibox detector//Proceeding of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Spring: 21-37[DOI: 10.1007/978-3-319-46448-0_2http://dx.doi.org/10.1007/978-3-319-46448-0_2]

Liu W, Chen C F and Wong K Y K. 2018b. Char-Net: a character-aware neural network for distorted scene text recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). New Orleans, USA: AAAI: 7154-7161

Liu W, Chen C F, Wong K Y K, Su Z Z and Han J Y. 2016b. STAR-Net: a spatial attention residue network for scene text recognition//Proceedings of 2016 British Machine Vision Conference. York, UK: BMVA[DOI: 10.5244/C.30.43http://dx.doi.org/10.5244/C.30.43]

Liu X B, Liang D, Yan S, Chen D G, Qiao Y and Yan J J. 2018c. FOTS: fast oriented text spotting with a unified network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5676-5685[DOI: 10.1109/CVPR.2018.00595http://dx.doi.org/10.1109/CVPR.2018.00595]

Liu X H, Kawanishi T, Wu X M and Kashino K. 2016c. Scene text recognition with CNN classifier and WFST-based word labeling//Proceedings of the 23rd International Conference on Pattern Recognition. Cancun, Mexico: IEEE: 3999-4004[DOI: 10.1109/ICPR.2016.7900259http://dx.doi.org/10.1109/ICPR.2016.7900259]

Liu Y, Wang Z W, Jin H L and Wassell I. 2018d. Synthetically supervised feature learning for scene text recognition//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 449-465[DOI: 10.1007/978-3-030-01228-1_27http://dx.doi.org/10.1007/978-3-030-01228-1_27]

Liu Y L, Chen H, Shen C H, He T, Jin L W and Wang L W. 2020. ABCNet: real-timescene text spotting with adaptive bezier-curve network//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9809-9818[DOI: 10.1109/CVPR42600.2020.00983http://dx.doi.org/10.1109/CVPR42600.2020.00983]

Liu Y L and Jin L W. 2017. Deep matching prior network: toward tighter multi-oriented text detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1962-1969[DOI: 10.1109/CVPR.2017.368http://dx.doi.org/10.1109/CVPR.2017.368]

Liu Y L, Jin L W, Zhang S T, Luo C J and Zhang S. 2019b. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90: 337-345[DOI:10.1016/j.patcog.2019.02.002]

Liu Y L, Zhang S, Jin L W, Xie L L, Wu Y Q and WangZ P. 2019c. Omnidirectional scene text detection with sequential-free box discretization//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: [s. n.]: 3052-3058[DOI: 10.24963/ijcai.2019/423http://dx.doi.org/10.24963/ijcai.2019/423]

Liu Z C, Li Y X, Ren F B, Goh W L and Yu H. 2018e. SqueezedText: a real-time scene text recognition by binary convolutional encoder-decoder network//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). New Orleans, USA: AAAI: 7194-7201

Liu Z C, Lin G S, Yang S, Liu F Y, Lin W S and Goh W L. 2019d. Towards robust curve text detection with conditional spatial expansion//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7269-7278[DOI: 10.1109/CVPR.2019.00744http://dx.doi.org/10.1109/CVPR.2019.00744]

Liu J M, Zhang C Q, Sun Y P, Han J Y and Ding E R. 2018f. Detecting text in the wild with deep character embedding network//Proceedings of the 14th Asian Conference on Computer Vision. Perth, Australia: Springer: 501-517[DOI: 10.1007/978-3-030-20870-7_31http://dx.doi.org/10.1007/978-3-030-20870-7_31]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[DOI: 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965]

Long S B, Ruan J Q, Zhang W J, He X, Wu W H and Yao C. 2018. TextSnake: a flexible representation for detecting text of arbitrary shapes//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 19-35[DOI: 10.1007/978-3-030-01216-8_2http://dx.doi.org/10.1007/978-3-030-01216-8_2]

Luo C J, Jin L W and Sun Z H. 2019. MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recognition, 90: 109-118[DOI:10.1016/j.patcog.2019.01.020]

Luo C J, Lin Q X, Liu Y L, Jin L W and Shen CH. 2021. Separating content from style using adversarial learning for recognizing text in the wild. International Journal of Computer Vision[DOI:10.1007/s11263-020-01411-1]

Luo C J, Zhu Y Z, Jin L W and Wang Y P. 2020. Learn to Augment: joint data augmentation and network optimization for text recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13743-13752[DOI: 10.1109/CVPR42600.2020.01376http://dx.doi.org/10.1109/CVPR42600.2020.01376]

Lyu P, Liao M H, Yao C, Wu W H and Bai X. 2018b. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 71-88[DOI: 10.1007/978-3-030-01264-9_5http://dx.doi.org/10.1007/978-3-030-01264-9_5]

Lyu P, Yao C, Wu W H, Yan S C and Bai X. 2018a. Multi-oriented scene text detection via corner localization and region segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7553-7563[DOI: 10.1109/CVPR.2018.00788http://dx.doi.org/10.1109/CVPR.2018.00788]

Ma J Q, Shao W Y, Ye H, Wang L, Wang H, Zheng Y B and Xue X Y. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11): 3111-3122[DOI:10.1109/TMM.2018.2818020]

Matas J, Chum O, Urban M and Pajdla T. 2004. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10): 761-767[DOI:10.1016/j.imavis.2004.02.006]

Minetto R, Thome N, Cord M, Leite N J and Stolfi J. 2013. T-HOG: an effective gradient-based descriptor for single line text regions. Pattern Recognition, 46(3): 1078-1090[DOI:10.1016/j.patcog.2012.10.009]

Miao Y, Gowayyed M, Metze F. 2015. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding//IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Scottsdale, USA: IEEE: 167-174[DOI: 10.1109/ASRU.2015.7404790http://dx.doi.org/10.1109/ASRU.2015.7404790]

Mishra A, Alahari K and Jawahar C V. 2012. Scene text recognition using higher order language priors//Proceedings of 2012 British Machine Vision Conference. Surrey, UK: BMVA Press: 1-11[DOI: 10.5244/C.26.127http://dx.doi.org/10.5244/C.26.127]

Mishra A, Alahari K and Jawahar C V. 2016. Enhancing energy minimization framework for scene text recognition with top-down cues. Computer Vision and Image Understanding, 145: 30-42[DOI:10.1016/j.cviu.2016.01.002]

Mou Y Q, Tan L, Yang H, Chen J Y, Liu L Y, Yan R and Huang Y H. 2020. PlugNet: degradation aware scene text recognition supervised by a pluggable super-resolution unit//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 158-174[DOI: 10.1007/978-3-030-58555-6_10http://dx.doi.org/10.1007/978-3-030-58555-6_10]

Nayef N, Yin F, Bizid I, Choi H, Feng Y, KaratzasD, Luo Z B, Pal U, Rigaud C, Chazalon J, Khlif W, Luqman M M, Burie J C, Liu C L and Ogier J M. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto, Japan: IEEE: 1454-1459[DOI: 10.1109/ICDAR.2017.237http://dx.doi.org/10.1109/ICDAR.2017.237]

Neumann L and Matas J. 2010. A method for text localization and recognition in real-world images//Proceedings of the 10th Asia Conference on Computer Vision. Queenstown, New Zealand: Springer: 770-783[DOI: 10.1007/978-3-642-19318-7_60http://dx.doi.org/10.1007/978-3-642-19318-7_60]

Neumann L and Matas J. 2012. Real-time scene text localization and recognition//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3538-3545[DOI: 10.1109/CVPR.2012.6248097http://dx.doi.org/10.1109/CVPR.2012.6248097]

Phan T Q, Shivakumara P, Tian S X and Tan C L. 2013. Recognizing text with perspective distortion in natural scenes//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 569-576[DOI: 10.1109/ICCV.2013.76http://dx.doi.org/10.1109/ICCV.2013.76]

Qi X B, Chen Y H, Xiao R, Li C G, Zou Q and Cui S G. 2019. A novel joint character categorization and localization approach for character-level scene text recognition//Proceedings of 2019 International Conference on Document Analysis and Recognition Workshops. Sydney, Australia: IEEE: 83-90[DOI: 10.1109/ICDARW.2019.40086http://dx.doi.org/10.1109/ICDARW.2019.40086]

Qiao L, Tang S L, Cheng Z Z, Xu Y L, Niu Y, Pu S L and Wu F. 2020a. Text perceptron: towards end-to-end arbitrary-shaped text spotting//Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, the 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020. New York, USA: AAAI: 11899-11907

Qiao Z, Zhou Y, Yang D B, Zhou Y C and Wang W P. 2020b. SEED: semantics enhanced encoder-decoder framework for scene text recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13525-13534[DOI: 10.1109/CVPR42600.2020.01354http://dx.doi.org/10.1109/CVPR42600.2020.01354]

Qin S Y, Bissaco A, Raptis M, Fujii Y and Xiao Y. 2019. Towards unconstrained end-to-end text spotting//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 4704-4714[DOI: 10.1109/ICCV.2019.00480http://dx.doi.org/10.1109/ICCV.2019.00480]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 779-788[DOI: 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91]

Redmon J and Farhadi A. 2017. YOLO9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6517-6525[DOI: 10.1109/CVPR.2017.690http://dx.doi.org/10.1109/CVPR.2017.690]

Ren S Q, He K M, Girshick R B and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of 2015 Annual Conference on Neural Information Processing Systems. Montreal, Canada: [s. n.]: 91-99

Risnumawan A, Shivakumara P, Chan C S and Tan C L. 2014. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18): 8027-8048[DOI:10.1016/j.eswa.2014.07.008]

Rodriguez-Serrano J A, Gordo A and Perronnin F. 2015. Label embedding: a frugal baseline for text recognition. International Journal of Computer Vision, 113(3): 193-207[DOI:10.1007/s11263-014-0793-6]

Ronneberger O, Fischer P and Brox T. 2015. U-net: convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer: 234-241[DOI: 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28]

Sheng F F, Chen Z N and Xu B. 2019. NRTR: a no-recurrence sequence-to-sequence model for scene text recognition//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney, Australia: IEEE: 781-786[DOI: 10.1109/ICDAR.2019.00130http://dx.doi.org/10.1109/ICDAR.2019.00130]

Shi B G, Bai X and Belongie S. 2017b. Detecting oriented text in natural images by linking segments//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2550-2558[DOI: 10.1109/CVPR.2017.371http://dx.doi.org/10.1109/CVPR.2017.371]

Shi B G, Bai X and Yao C. 2017a. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11): 2298-2304[DOI:10.1109/TPAMI.2016.2646371]

Shi B G, Wang X G, Lyu P, Yao C and Bai X. 2016. Robust scene text recognition with automatic rectification//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4168-4176[DOI: 10.1109/CVPR.2016.452http://dx.doi.org/10.1109/CVPR.2016.452]

Shi B G, Yang M K, Wang X G, Lyu P, Yao C and Bai X. 2019. ASTER: an attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9): 2035-2048[DOI:10.1109/TPAMI.2018.2848939]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: [s. n.]

Su B L and Lu S J. 2014. Accurate scene text recognition based on recurrent neural network//Proceedings of the 12th Asian Conference on Computer Vision. Singapore, Singapore: Springer: 35-48[DOI: 10.1007/978-3-319-16865-4_3http://dx.doi.org/10.1007/978-3-319-16865-4_3]

Su B L and Lu S J. 2017. Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recognition, 63: 397-405[DOI:10.1016/j.patcog.2016.10.016]

Sun Y P, Liu J M, Liu W, Han J Y, Ding E R and Liu J T. 2019. Chinese street view text: large-scale Chinese text reading with partially supervised learning//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9086-9095[DOI: 10.1109/ICCV.2019.00918http://dx.doi.org/10.1109/ICCV.2019.00918]

Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1-9[DOI: 10.1109/cvpr.2015.7298594http://dx.doi.org/10.1109/cvpr.2015.7298594]

Tang J, Yang Z B, Wang Y P, Zheng Q, Xu Y C and Bai X. 2019. Seglink++: Detecting dense and arbitrary-shaped scene text by instance-aware component grouping. Pattern Recognition, 96: #106954[DOI:10.1016/j.patcog.2019.06.020]

Tian S X, Lu S J and Li C S. 2017. WeText: scene text detection under weak supervision//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1492-1500[DOI: 10.1109/ICCV.2017.166http://dx.doi.org/10.1109/ICCV.2017.166]

Tian Z, Huang W L, He T, He P and Qiao Y. 2016. Detecting text in natural image with connectionist text proposal network//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 56-72[DOI: 10.1007/978-3-319-46484-8_4http://dx.doi.org/10.1007/978-3-319-46484-8_4]

Tian Z T, Shu M, Lyu P, Li R Y, Zhou C, Shen X Y and Jia J Y. 2019. Learning shape-aware embedding for scene text detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4229-4238[DOI: 10.1109/CVPR.2019.00436http://dx.doi.org/10.1109/CVPR.2019.00436]

Vaswani A, Shazeer N, Parmar N, Uszkoreit N, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM: 5998-6008

Wan Z Y, Xie F M, Liu Y B, Bai X and Yao C. 2019. 2D-CTC for scene text recognition[EB/OL]. [2021-01-21].https: //arxiv.org/pdf/1907.09705.pdfhttps: //arxiv.org/pdf/1907.09705.pdf

Wan Z, He M, Chen H, Bai X and Yao C. 2020. Textscanner: reading characters in order for robust scene text recognition//Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, the 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020. New York, USA: AAAI

Wang C, Yin F and Liu C L. 2018a. Memory-augmented attention model for scene text recognition//Proceedings of the 16th International Conference on Frontiers in Handwriting Recognition. Niagara Falls, USA: IEEE: 62-67[DOI: 10.1109/ICFHR-2018.2018.00020http://dx.doi.org/10.1109/ICFHR-2018.2018.00020]

Wang F F, Zhao L M, Li X, Wang X C and Tao D C. 2018b. Geometry-aware scene text detection with instance transformation network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1381-1389[DOI: 10.1109/CVPR.2018.00150http://dx.doi.org/10.1109/CVPR.2018.00150]

Wang H, Lu P, Zhang H, Yang M K, Bai X, Xu Y C, He M C, Wang Y P and Liu W Y. 2020a. All you need is boundary: Toward arbitrary-shaped text spotting//Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, the 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020. New York, USA: AAAI: 12160-12167

Wang J F and Hu X L. 2017. Gated recurrent convolution neural network for OCR//Proceedings of 2017 Annual Conference on Neural Information Processing Systems. Long Beach, USA: [s. n.]: 335-344

Wang K, Babenko B and Belongie S. 2011. End-to-end scene text recognition//Proceedings of 2011 IEEE International Conference on Computer Vision. Barcelona, Spain: IEEE: 1457-1464[DOI: 10.1109/ICCV.2011.6126402http://dx.doi.org/10.1109/ICCV.2011.6126402]

Wang P, Yang L, Li H, Deng Y Y, Shen C H and Zhang Y N. 2019b. A simple and robust convolutional-attention network for irregular text recognition[EB/OL]. [2021-01-21].https://deepai.org/publication/a-simple-and-robust-convolutional-attention-network-for-irregular-text-recognitionhttps://deepai.org/publication/a-simple-and-robust-convolutional-attention-network-for-irregular-text-recognition

Wang P F, Zhang C G, Qi F, Huang Z M, En M Y, Han J Y, Liu J T, Ding E R and Shi G M. 2019a. A single-shot arbitrarily-shaped text detector based on context attended multi-task learning//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 1277-1285[DOI: 10.1145/3343031.3350988http://dx.doi.org/10.1145/3343031.3350988]

Wang Q, Liu S T, Chanussot J and Li X L. 2019d. Scene classification with recurrent attention of VHR remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 57(2): 1155-1167[DOI:10.1109/TGRS.2018.2864987]

Wang Q Q, Jia W J, He X J, Lu Y, Blumenstein M, Huang Y and Lyu S. 2019c. ReELFA: a scene text recognizer with encoded location and focused attention//Proceedings of 2019 International Conference on Document Analysis and Recognition Workshops. Sydney, Australia: IEEE: 71-76[DOI: 10.1109/ICDARW.2019.40084http://dx.doi.org/10.1109/ICDARW.2019.40084]

Wang S W, Wang Y T, Qin X R, Zhao Q J and Tang Z. 2019e. Scene text recognition via gated cascade attention//Proceedings of 2019 IEEE International Conference on Multimedia and Expo. Shanghai, China: IEEE: 1018-1023[DOI: 10.1109/ICME.2019.00179http://dx.doi.org/10.1109/ICME.2019.00179]

Wang T, Wu D J, Coates A and Ng A Y. 2012. End-to-end text recognition with convolutional neural networks//Proceedings of the 21st International Conference on Pattern Recognition. Tsukuba, Japan: IEEE: 3304-3308

Wang T W, Zhu Y Z, Jin L W, Luo C J, Chen X X, Wu Y Q, Wang Q Y and Cai M X. 2020b. Decoupled attention network for text recognition//Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, the 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020. New York, USA: AAAI

Wang W H, Xie E Z, Li X, Hou W B, Lu T, Yu G and Shao S. 2019f. Shape robust text detection with progressive scale expansion network//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9336-9345[DOI: 10.1109/CVPR.2019.00956http://dx.doi.org/10.1109/CVPR.2019.00956]

Wang W H, Xie E Z, Song X G, Zang Y H, Wang W J, Lu T, Yu G and Shen C H. 2019g. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): 8440-8449[DOI: 10.1109/ICCV.2019.00853http://dx.doi.org/10.1109/ICCV.2019.00853]

Wang W J, Xie E Z, Liu X B, Wang W H, Liang D, Shen C H and Bai X. 2020c. Scene text image super-resolution in the wild//Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK: Springer: 650-666[DOI: 10.1007/978-3-030-58607-2_38http://dx.doi.org/10.1007/978-3-030-58607-2_38]

Wang X B, Jiang Y Y, Luo Z B, Liu C L, Choi H and Kim S. 2019h. Arbitrary shape scene text detection with adaptive text region representation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6449-6458[DOI: 10.1109/CVPR.2019.00661http://dx.doi.org/10.1109/CVPR.2019.00661]

Wang Y X, Xie H T, Zha Z J, Xing M T, Fu Z L and Zhang Y D. 2020d. ContourNet: taking a further step toward accurate arbitrary-shaped scene text detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11753-11762[DOI: 10.1109/CVPR42600.2020.01177http://dx.doi.org/10.1109/CVPR42600.2020.01177]

Wu L, Zhang C Q, Liu J M, Han J Y, Liu J T, Ding E R and Bai X. 2019. Editing text in the wild//Proceedings of ACM International Conference on Multimedia. Nice, France: ACM: 1500-1508[DOI: 10.1145/3343031.3350929http://dx.doi.org/10.1145/3343031.3350929]

Wu Y and Natarajan P. 2017. Self-organized text detection with minimal post-processing via border learning//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5000-5009[DOI: 10.1109/ICCV.2017.535http://dx.doi.org/10.1109/ICCV.2017.535]

Xiao S Y, Peng L R, Yan R J, An K Y, Yao G and Min J. 2020. Sequential deformation for accurate scene text detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 108-124[DOI: 10.1007/978-3-030-58526-6_7http://dx.doi.org/10.1007/978-3-030-58526-6_7]

Xie E Z, Zang Y H, Shao S, Yu G, Yao C and Li G Y. 2019a. Scene text detection with supervised pyramid context network//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, the 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. Honolulu, USA: AAAI: 9038-9045

Xie H T, Fang S C, Zha Z J, Yang Y T, Li Y and Zhang Y D. 2019b. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications, 15(1S): #3[DOI:10.1145/3231737]

Xie Z C, Huang Y X, Zhu Y Z, Jin L W, Liu Y L and Xie L L. 2019c. Aggregation cross-entropy for sequence recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6531-6540[DOI: 10.1109/CVPR.2019.00670http://dx.doi.org/10.1109/CVPR.2019.00670]

Xing L J, Tian Z, Huang W L and Scott M. 2019. Convolutional character networks//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9126-9136[DOI: 10.1109/ICCV.2019.00922http://dx.doi.org/10.1109/ICCV.2019.00922]

Xu Y C, Wang Y K, Zhou W, Wang Y P, Yang Z B and Bai X. 2019. Textfield: learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing, 28(11): 5566-5579[DOI:10.1109/TIP.2019.2900589]

Xue C H, Lu S J and Zhan F N. 2018. Accurate scene text detection through border semantics awareness and bootstrapping//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 355-372[DOI: 10.1007/978-3-030-01270-0_22http://dx.doi.org/10.1007/978-3-030-01270-0_22]

Xue C H, Lu S J and Zhang W. 2019. MSR: multi-scale shape regression for scene text detection//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, Chian: [s. n.]: 989-995

Yang M K, Guan Y S, Liao M H, He X, Bian K G, Bai S, Yao C and Bai X. 2019. Symmetry-constrained rectification network for scene text recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9147-9156[DOI: 10.1109/ICCV.2019.00924http://dx.doi.org/10.1109/ICCV.2019.00924]

Yang Q P, Cheng M L, Zhou W M, Chen Y, Qiu M H and Lin W. 2018. Inceptext: a new inception-text module with deformable psroi pooling for multi-oriented scene text detection//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden: ACM: 1071-1077

Yang X, He D F, Zhou Z H, Kifer D and Giles C L. 2017. Learning to read irregular text with attention mechanisms//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: ACM: 3280-3286

Yao C, Bai X and Liu W Y. 2014a. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing, 23(11): 4737-4749[DOI:10.1109/TIP.2014.2353813]

Yao C, Bai X, Liu W Y, Ma Y and Tu Z W. 2012. Detecting texts of arbitrary orientations in natural images//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 1083-1090[DOI: 10.1109/CVPR.2012.6247787http://dx.doi.org/10.1109/CVPR.2012.6247787]

Yao C, Bai X, Sang N, Zhou X Y, Zhou S C and Cao Z M. 2016. Scene text detection via holistic, multi-channel prediction[EB/OL]. [2021-01-21].https://arxiv.org/pdf/1606.09002.pdfhttps://arxiv.org/pdf/1606.09002.pdf

Yao C, Bai X, Shi B G and Liu W Y. 2014b. Strokelets: a learned multi-scale representation for scene text recognition//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 4042-4049[DOI: 10.1109/CVPR.2014.515http://dx.doi.org/10.1109/CVPR.2014.515]

Yin F, Wu Y C, Zhang X Y and Liu C L. 2017. Scene text recognition with sliding convolutional character models[EB/OL]. [2021-01-06].https://arxiv.org/pdf/1709.07727.pdfhttps://arxiv.org/pdf/1709.07727.pdf

Yu D L, Li X, Zhang C Q, Liu T, Han J Y, Liu J T and Ding E R. 2020. Towards accurate scene text recognition with semantic reasoning networks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 12110-12119[DOI: 10.1109/CVPR42600.2020.01213http://dx.doi.org/10.1109/CVPR42600.2020.01213]

Yue X Y, Kuang Z H, Lin C H, Sun H B and Zhang W. 2020. RobustScanner: dynamically enhancing positional clues for robust text recognition//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 135-151[DOI: 10.1007/978-3-030-58529-7_9http://dx.doi.org/10.1007/978-3-030-58529-7_9]

Zhan F N and Lu S J. 2019. ESIR: end-to-end scene text recognition via iterative image rectification//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2059-2068[DOI: 10.1109/CVPR.2019.00216http://dx.doi.org/10.1109/CVPR.2019.00216]

Zhan F N, Lu S J and Xue C H. 2018. Verisimilar image synthesis for accurate detection and recognition of texts in scenes//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 249-266[DOI: 10.1007/978-3-030-01237-3_16http://dx.doi.org/10.1007/978-3-030-01237-3_16]

Zhan F N, Zhu HY and Lu S J. 2019. Spatial fusion GAN for image synthesis//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3653-3662[DOI: 10.1109/CVPR.2019.00377http://dx.doi.org/10.1109/CVPR.2019.00377]

Zhang C H, Gupta A and Zisserman A. 2020a. Adaptive text recognition through visual matching//Proceeding of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 51-67[DOI: 10.1007/978-3-030-58517-4_4http://dx.doi.org/10.1007/978-3-030-58517-4_4]

Zhang C Q, Liang B R, Huang Z M, En M Y, Han J Y, Ding E R and Ding X H. 2019a. Look more than once: an accurate detector for text of arbitrary shapes//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 10552-10561[DOI: 10.1109/CVPR.2019.01080http://dx.doi.org/10.1109/CVPR.2019.01080]

Zhang H, Yao Q M, Yang M K, Xu Y C and Bai X. 2020b. AutoSTR: efficient backbone search for scene text recognition//Proceeding of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 751-67[DOI: 10.1007/978-3-030-58586-0_44http://dx.doi.org/10.1007/978-3-030-58586-0_44]

Zhang S X, Zhu X B, Hou J B, Liu C, Yang C, Wang H F and Yin X C. 2020c. Deep relational reasoning graph network for arbitrary shape text detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 9699-9708[DOI: 10.1109/CVPR42600.2020.00972http://dx.doi.org/10.1109/CVPR42600.2020.00972]

Zhang Y P, Nie S, Liu W J, Xu X, Zhang D X and Shen H T. 2019b. Sequence-to-sequence domain adaptation network for robust text image recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2735-2744[DOI: 10.1109/CVPR.2019.00285http://dx.doi.org/10.1109/CVPR.2019.00285]

Zhang Z, Zhang C Q, Shen W, Yao C, Liu W Y and Bai X. 2016. Multi-oriented text detection with fully convolutional networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 4159-4167[DOI: 10.1109/CVPR.2016.451http://dx.doi.org/10.1109/CVPR.2016.451]

Zhong Y, Karu K and Jain A K. 1995. Locating text in complex color images. Pattern Recognition, 28(10): 1523-1535[DOI:10.1016/0031-3203(95)00030-4]

Zhong Z Y, Jin L W and Huang S P. 2017. DeepText: a new approach for text proposal generation and text detection in natural images//Proceedings of 2017 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, USA: IEEE: 1208-1212[DOI: 10.1109/ICASSP.2017.7952348http://dx.doi.org/10.1109/ICASSP.2017.7952348]

Zhong Z Y, Sun L and Huo Q. 2019. An anchor-free region proposal network for Faster R-CNN-based text detection approaches. International Journal on Document Analysis and Recognition, 22(3): 315-327[DOI:10.1007/s10032-019-00335-y]

Zhou X Y, Yao C, Wen H, Wang Y Z, Zhou S C, He W R and Liang J J. 2017. EAST: an efficient and accurate scene text detector//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5551-5560[DOI: 10.1109/CVPR.2017.283http://dx.doi.org/10.1109/CVPR.2017.283]

Zhu Y W, Wang S L, Huang Z and Chen K. 2019. Text recognition in images based on transformer with hierarchical attention//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China: IEEE: 1945-1949[DOI: 10.1109/ICIP.2019.8803203http://dx.doi.org/10.1109/ICIP.2019.8803203]

Zhu Y X and Du J. 2021. TextMountain: accurate scene text detection via instance segmentation. Pattern Recognition, 110: #107336[DOI:10.1016/j.patcog.2020.107336]

Zitnick C L and Dollar P. 2014. Edge boxes: locating object proposals from edges//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 391-405[DOI: 10.1007/978-3-319-10602-1_26http://dx.doi.org/10.1007/978-3-319-10602-1_26]

文章被引用时，请邮件提醒。

提交