以文字为中心的图像理解技术综述

张言; 李强; 申化文; 曾港艳; 周宇; 马灿; 张远; 王伟平

doi:10.11834/jig.220968

文档图像智能处理与识别 | 浏览量 : 0 下载量: 3 CSCD: 1

PDF
导出
分享
收藏
专辑

以文字为中心的图像理解技术综述
Text-centric image analysis techniques： a crtical review
2023年28卷第8期页码：2253-2275
纸质出版日期： 2023-08-16 ，
DOI： 10.11834/jig.220968
稿件说明：

移动端阅览

张言，李强，申化文，曾港艳，周宇，马灿，张远，王伟平. 2023. 以文字为中心的图像理解技术综述. 中国图象图形学报， 28(08):2253-2275

Zhang Yan， Li Qiang， Shen Huawen， Zeng Gangyan， Zhou Yu， Ma Can， Zhang Yuan， Wang Weiping. 2023. Text-centric image analysis techniques： a crtical review. Journal of Image and Graphics， 28(08):2253-2275
张言，李强，申化文，曾港艳，周宇，马灿，张远，王伟平. 2023. 以文字为中心的图像理解技术综述. 中国图象图形学报， 28(08):2253-2275 DOI： 10.11834/jig.220968.

Zhang Yan， Li Qiang， Shen Huawen， Zeng Gangyan， Zhou Yu， Ma Can， Zhang Yuan， Wang Weiping. 2023. Text-centric image analysis techniques： a crtical review. Journal of Image and Graphics， 28(08):2253-2275 DOI： 10.11834/jig.220968.

摘要

文字广泛存在于各种文档图像和自然场景图像之中，蕴含着丰富且关键的语义信息。随着深度学习的发展，研究者不再满足于只获得图像中的文字内容，而更加关注图像中文字的理解，故以文字为中心的图像理解技术受到越来越多的关注。该技术旨在利用文字、视觉物体等多模态信息对文字图像进行充分理解，是计算机视觉和自然语言处理领域的一个交叉研究方向，具有十分重要的实际意义。本文主要对具有代表性的以文字为中心的图像理解任务进行综述，并按照理解认知程度，将以文字为中心的图像理解任务划分为两类，第1类仅要求模型具备抽取信息的能力，第2类不仅要求模型具备抽取信息的能力，而且要求模型具备一定的分析和推理能力。本文梳理了以文字为中心的图像理解任务所涉及的数据集、评价指标和经典方法，并进行对比分析，提出了相关工作中存在的问题和未来发展趋势，希望能够为后续相关研究提供参考。

Abstract

Text can be as one of the key carriers for information transmission. Digital media-related text has been widely developing for such image aspects of document and scene contexts. To extract and analyze these text information-involved images automatically， Conventional researches are mainly focused on automatic text extraction techniques like scene text detection and recognition. However， text-centric images-based semantic information recognition or analysis as a downstream task of spotting text， remains a challenge due to the difficulty of fully leveraging multi-modal features from both vision and language. To this end， text-centric image understanding has been an emerging research topic and many related tasks have been proposed. For example， the visual information extraction technique is capable of extracting the specified content from the given image， which can be used to improve productivity in finance， social media， and other fields. In this paper， we introduce five representative text-centric image understanding tasks and conduct a systematic survey on them. According to the understanding level， these tasks can be broadly classified into two categories. The first category requires the basic understanding ability to extract and distinguish information， such as visual information extraction and scene text retrieval. In contrast， besides the fundamental understanding ability， the second category is more concerned with high-level semantic understanding capabilities like information aggregation and logical reasoning. With the research progress in deep learning and multimodal learning， the second category has attracted considerable attention recently. For the second category， this survey mainly introduces document visual question answering， scene text visual question answering， and scene text image captioning tasks. Over the past few decades， the development of text-centric image understanding techniques has gone through several stages. Earlier approaches are based on heuristic rules and may only utilize unimodal features. Currently， deep learning methods have gained wide popularity and dominated this area. Meanwhile， multimodal features are valued and exploited to improve performance. To be more specific， traditional visual information extraction depends on pre-defined templates or specific rules. Traditional text retrieval task tends to represent words with pyramid histograms of character vectors and predict the matched image according to the representation distance. Expanded from the conventional visual question answering framework， earlier document visual question answering， and scene text visual question answering approaches simply add an optical character recognition branch to extract text information. As integrating knowledge from multimodal signals helps to better understand images， graph neural networks and Transformer-based frameworks are used to fuse multi-modal features recently. Furthermore， self-supervised pre-training schemes are applied to learn the alignment between different modalities， thus boosting model capabilities by a large margin. For each text-centric image understanding task， we summarize classical methods and further elaborate the pros and cons of them. In addition， we also discuss the potential problems and further research directions for the community. Firstly， due to the complexity of different modality features， such as mutative layout and diverse fonts， current deep learning architectures still fail to complete the interaction of multi-modal information efficiently. Secondly， existing text-centric image understanding methods are still limited in their reasoning abilities， involving counting， sorting， and arithmetic operations. For instance， in document visual question answering and scene text visual question answering tasks， current models have difficulty predicting accurate answers when they require to jointly reason over image layout， textual content， and visual art， etc. Finally， the current text-centric understanding tasks are often trained independently and the correlation between different tasks has not been effectively leveraged. We hope this survey can help researchers capture the latest progress in text-centric image understanding and inspire the new design of advanced models and algorithms.

关键词

文字图像理解视觉信息抽取场景文字图像检索文档视觉回答场景文字视觉问答场景文字图像描述

Keywords

text image understandingvisual information extractionscene text retrievaldocument visual question answeringscene text visual question answeringscene text image captioning

references

Agarwal A and Lavie A. 2008. METEOR， M-BLEU and M-TER： evaluation metrics for high-correlation with human rankings of machine translation output//Proceedings of the 3rd Workshop on Statistical Machine Translation. Columbus， USA： Association for Computational Linguistics： 115-118 ［DOI： 10.3115/1626394.1626406http://dx.doi.org/10.3115/1626394.1626406］

Anand M， Karteek A and Jawahar C V. 2013. Image retrieval using textual cues//Proceedings of 2013 ICCV.［s.l.］：［s.n.］

Almaz􀆦n J， Gordo A， Fornés A and Valveny E. 2014. Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence， 36（12）： 2552-2566 ［DOI： 10.1109/TPAMI.2014.2339814http://dx.doi.org/10.1109/TPAMI.2014.2339814］

Anderson P， Fernando B， Johnson M and Gould S. 2016. SPICE： semantic propositional image caption evaluation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， the Netherlands： Springer： 382-398 ［DOI： 10.1007/978-3-319-46454-1_24http://dx.doi.org/10.1007/978-3-319-46454-1_24］

Appalaraju S， Jasani B， Kota B U， Xie Y S and Manmatha R. 2021. DocFormer： end-to-end transformer for document understanding//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 973-983 ［DOI： 10.1109/ICCV48922.2021.00103http://dx.doi.org/10.1109/ICCV48922.2021.00103］

Biten A F， Litman R， Xie Y S， Appalaraju S and Manmatha R. 2021. LaTr： layout-aware transformer for scene-text VQA ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2112.12494.pdfhttps://arxiv.org/pdf/2112.12494.pdf

Biten A F， Tito R， Mafla A， Gomez L， Rusiñol M， Jawahar C V， Valveny E and Karatzas D. 2019. Scene text visual question answering//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 4290-4300 ［DOI： 10.1109/ICCV.2019.00439http://dx.doi.org/10.1109/ICCV.2019.00439］

Carbonell M， Riba P， Villegas M， Fornés A and Lladós J. 2020. Named entity recognition and relation extraction with graph neural networks in semi structured documents//Proceedings of the 25th International Conference on Pattern Recognition. Milan， Italy： IEEE： 9622-9627 ［DOI： 10.1109/ICPR48806.2021.9412669http://dx.doi.org/10.1109/ICPR48806.2021.9412669］

Chu X X， Tian Z， Zhang B， Wang X L and Shen C H. 2021. Conditional positional encodings for vision transformers ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2102.10882.pdfhttps://arxiv.org/pdf/2102.10882.pdf

Cui L， Xu Y H， Lv T C and Wei F R. 2021. Document AI： benchmarks， models and applications ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2111.08609.pdfhttps://arxiv.org/pdf/2111.08609.pdf

Dai Z H， Yang Z L， Yang Y M， Carbonell J， Le Q and Salakhutdinov R. 2019. Transformer-XL： attentive language models beyond a fixed-length context//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence， Italy： Association for Computational Linguistics： 2978-2988 ［DOI： 10.18653/v1/P19-1285http://dx.doi.org/10.18653/v1/P19-1285］

Denk T I and Reisswig C. 2019. BERTgrid： contextualized embedding for 2D document representation and understanding［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/1909.04948.pdfhttps://arxiv.org/pdf/1909.04948.pdf

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： pre-training of deep bidirectional transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Minneapolis， Minnesota， USA： Association for Computational Linguistics： 4171-4186 ［DOI： 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： transformers for image recognition at scale ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Gao C Y， Zhu Q， Wang P， Li H， Liu Y L， van den Hengel A and Wu Q. 2020a. Structured multimodal attentions for TextVQA ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2006.00753.pdfhttps://arxiv.org/pdf/2006.00753.pdf

Gao D F， Li K， Wang R P， Shan S G and Chen X L. 2020b. Multi-modal graph neural network for joint reasoning on vision and scene text//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 12743-12753 ［DOI： 10.1109/CVPR42600.2020.01276http://dx.doi.org/10.1109/CVPR42600.2020.01276］

Ghosh S K， Gómez L， Karatzas D and Valveny E. 2015. Efficient indexing for query by string text retrieval//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis， Tunisia： IEEE： 1236-1240 ［DOI： 10.1109/ICDAR.2015.7333961http://dx.doi.org/10.1109/ICDAR.2015.7333961］

Gómez L， Biten A F， Tito R， Mafla A， Rusiñol M， Valveny E and Karatzas D. 2021. Multimodal grid features and cell pointers for scene text visual question answering. Pattern Recognition Letters， 150： 242-249 ［DOI： 10.1016/j.patrec.2021.06.026http://dx.doi.org/10.1016/j.patrec.2021.06.026］

Gómez L， Mafla A， Rusiñol M and Karatzas D. 2018. Single shot scene text retrieval//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 728-744 ［DOI： 10.1007/978-3-030-01264-9_43http://dx.doi.org/10.1007/978-3-030-01264-9_43］

Graves A and Schmidhuber J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks， 18（5/6）： 602-610 ［DOI： 10.1016/j.neunet.2005.06.042http://dx.doi.org/10.1016/j.neunet.2005.06.042］

Gu Z X， Meng C H， Wang K， Lan J， Wang W Q， Gu M and Zhang L Q. 2022. XYLayoutLM： towards layout-aware multimodal networks for visually-rich document understanding ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2203.06947.pdfhttps://arxiv.org/pdf/2203.06947.pdf

Han W， Huang H T and Han T. 2020. Finding the evidence： localization-aware answer prediction for text visual question answering//Proceedings of the 28th International Conference on Computational Linguistics. Barcelona， Spain： International Committee on Computational Linguistics： 3118-3131 ［DOI： 10.18653/v1/2020.coling-main.278http://dx.doi.org/10.18653/v1/2020.coling-main.278］

He K M， Gkioxari G， Doll􀆦r P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2980-2988 ［DOI： 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Hu A W， Chen S Z and Jin Q. 2021. Question-controlled text-aware image captioning//Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event， China： ACM： 3097-3105 ［DOI： 10.1145/3474085.3475452http://dx.doi.org/10.1145/3474085.3475452］

Hu R H， Singh A， Darrell T and Rohrbach M. 2020. Iterative answer prediction with pointer-augmented multimodal Transformers for TextVQA//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 9989-9999 ［DOI： 10.1109/CVPR42600.2020.01001http://dx.doi.org/10.1109/CVPR42600.2020.01001］

Huang Y P， Lv T C， Cui L， Lu Y T and Wei F R. 2022. LayoutLMv3： pre-training for document AI with unified text and image masking［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2204.08387.pdfhttps://arxiv.org/pdf/2204.08387.pdf

Huang Z， Chen K， He J H， Bai X， Karatzas D， Lu S J and Jawahar C V. 2019. ICDAR2019 competition on scanned receipt OCR and information extraction//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney， Australia： IEEE： 1516-1520 ［DOI： 10.1109/ICDAR.2019.00244http://dx.doi.org/10.1109/ICDAR.2019.00244］

Jaume G， Ekenel H K and Thiran J P. 2019. FUNSD： a dataset for form understanding in noisy scanned documents//Proceedings of 2019 International Conference on Document Analysis and Recognition Workshops. Sydney， Australia： IEEE： 1-6 ［DOI： 10.1109/ICDARW.2019.10029http://dx.doi.org/10.1109/ICDARW.2019.10029］

Jin Z X， Wu H R， Yang C， Zhou F， Qin J Y， Xiao L and Yin X C. 2020. RUArt： a novel text-centered solution for text-based visual question answering ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2010.12917.pdfhttps://arxiv.org/pdf/2010.12917.pdf

Joulin A， Grave E， Bojanowski P and Mikolov T. 2017. Bag of tricks for efficient text classification//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics： Volume 2， Short Papers. Valencia， Spain： Association for Computational Linguistics： 427-431 ［DOI： 10.18653/v1/E17-2068http://dx.doi.org/10.18653/v1/E17-2068］

Kai W， Boris B and Serge J B. 2011. End-toend scene text recognition//Proceedings of 2011 ICCV. ［s.l.］：［s.n.］

Kant Y， Batra D， Anderson P， Schwing A， Parikh D， Lu J S and Agrawal H. 2020. Spatially aware multimodal Transformers for TextVQA//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 715-732 ［DOI： 10.1007/978-3-030-58545-7_41http://dx.doi.org/10.1007/978-3-030-58545-7_41］

Katti A R， Reisswig C， Guder C， Brarda S， Bickel S， Höhne J and Faddoul J B. 2018. Chargrid： towards understanding 2d documents//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels， Belgium： Association for Computational Linguistics： 4459-4469 ［DOI： 10.18653/v1/D18-1476http://dx.doi.org/10.18653/v1/D18-1476］

Kim W， Son B and Kim I. 2021. ViLT： vision-and-language transformer without convolution or region supervision//Proceedings of the 38th International Conference on Machine Learning. Virtual： PMLR： 5583-5594

Krasin I， Duerig T， Alldrin N， Veit A， Abu-El-Haija S， Belongie S， Cai D， Feng Z Y， Ferrari V and Gomes V. 2016. Openimages： a public dataset for large-scale multi-label and multi-class image classification.

Lafferty J D， McCallum A and Pereira F C N. 2001. Conditional random fields： probabilistic models for segmenting and labeling sequence data//Proceedings of the 18th International Conference on Machine Learning. Williams College， USA： Morgan Kaufmann： 282-289

Lee C Y， Li C L， Dozat T， Perot V， Su G L， Hua N， Ainslie J， Wang R S， Fujii Y and Pfister T. 2022. FormNet： structural encoding beyond sequential modeling in form document information extraction//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Dublin， Ireland： Association for Computational Linguistics： 3735-3754 ［DOI： 10.18653/v1/2022.acl-long.260http://dx.doi.org/10.18653/v1/2022.acl-long.260］

Li C L， Bi B， Yan M， Wang W， Huang S F， Huang F and Si L. 2021a. StructuralLM： Structural pre-training for form understanding//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Virtual： Association for Computational Linguistics： 6309-6318 ［DOI： 10.18653/v1/2021.acl-long.493http://dx.doi.org/10.18653/v1/2021.acl-long.493］

Li J W， Galley M， Brockett C， Gao J F and Dolan B. 2016. A diversity-promoting objective function for neural conversation models//Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. San Diego， USA： The Association for Computational Linguistics： 110-119 ［DOI： 10.18653/v1/N16-1014http://dx.doi.org/10.18653/v1/N16-1014］

Li P Z， Gu J X， Kuen J， Morariu V I， Zhao H D， Jain R， Manjunatha V and Liu H F. 2021b. SelfDoc： self-supervised document representation learning//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 5648-5656 ［DOI： 10.1109/CVPR46437.2021.00560http://dx.doi.org/10.1109/CVPR46437.2021.00560］

Li X P， Wu B， Song J K， Gao L L， Zeng P P and Gan C. 2022. Text-instance graph： exploring the relational semantics for text-based visual question answering. Pattern Recognition， 124： #108455 ［DOI： 10.1016/j.patcog.2021.108455http://dx.doi.org/10.1016/j.patcog.2021.108455］

Li Y L， Qian Y X， Yu Y C， Qin X M， Zhang C Q， Liu Y， Yao K， Han J Y， Liu J T and Ding E R. 2021c. StrucTexT： structured text understanding with multi-modal transformers//Proceedings of the 29th ACM International Conference on Multimedia. Virtual， China： ACM： 1912-1920 ［DOI： 10.1145/3474085.3475345http://dx.doi.org/10.1145/3474085.3475345］

Liao M H， Pang G， Huang J， Hassner T and Bai X. 2020. Mask TextSpotter v3： segmentation proposal network for robust scene text spotting//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 706-722 ［DOI： 10.1007/978-3-030-58621-8_41http://dx.doi.org/10.1007/978-3-030-58621-8_41］

Lin C Y. 2004. ROUGE： a package for automatic evaluation of summaries//Text Summarization Branches Out. Barcelona， Spain： Association for Computational Linguistics： 74-81

Lin W H， Gao Q F， Sun L， Zhong Z Y， Hu K， Ren Q and Huo Q. 2021. VIBERTgrid： a jointly trained multi-modal 2D document representation for key information extraction from documents//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 548-563 ［DOI： 10.1007/978-3-030-86549-8_35http://dx.doi.org/10.1007/978-3-030-86549-8_35］

Liu C Y， Chen X X， Luo C J， Jin L W， Xue Y and Liu Y L. 2021. Deep learning methods for scene text detection and recognition. Journal of Image and Graphics， 26（6）： 1330-1367

刘崇宇，陈晓雪，罗灿杰，金连文，薛洋，刘禹良. 2021. 自然场景文本检测与识别的深度学习方法. 中国图象图形学报， 26（6）： 1330-1367 ［DOI： 10.11834/jig.210044http://dx.doi.org/10.11834/jig.210044］

Liu F， Xu G H， Wu Q， Du Q， Jia W and Tan M K. 2020. Cascade reasoning network for text-based visual question answering//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 4060-4069 ［DOI： 10.1145/3394171.3413924http://dx.doi.org/10.1145/3394171.3413924］

Liu T Y. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval， 3（3）： 225-331 ［DOI： 10.1561/1500000016http://dx.doi.org/10.1561/1500000016］

Liu X J， Gao F Y， Zhang Q and Zhao H S. 2019. Graph convolution for multimodal information extraction from visually rich documents//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Industry Papers）. Minneapolis， Minnesota： Association for Computational Linguistics： 32-39 ［DOI： 10.18653/v1/N19-2005http://dx.doi.org/10.18653/v1/N19-2005］

Lu X P， Fan Z， Wang Y S， Oh J and Rosé C P. 2021. Localize， group， and select： boosting text-VQA by scene text modeling//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal， Canada： IEEE： 2631-2639 ［DOI： 10.1109/ICCVW54120.2021.00297http://dx.doi.org/10.1109/ICCVW54120.2021.00297］

Mafla A， Tito R， Dey S， Gómez L， Rusiñol M， Valveny E and Karatzas D. 2021. Real-time lexicon-free scene text retrieval. Pattern Recognition， 110： #107656 ［DOI： 10.1016/j.patcog.2020.107656http://dx.doi.org/10.1016/j.patcog.2020.107656］

Mathew M， Bagal V， Tito R， Karatzas D， Valveny E and Jawahar C V. 2022. InfographicVQA//Proceedings of 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 2582-2591 ［DOI： 10.1109/WACV51458.2022.00264http://dx.doi.org/10.1109/WACV51458.2022.00264］

Mathew M， Karatzas D and Jawahar C V. 2021. DocVQA： a dataset for VQA on document images//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 2199-2208 ［DOI： 10.1109/WACV48630.2021.00225http://dx.doi.org/10.1109/WACV48630.2021.00225］

Mishra A， Alahari K and Jawahar C V. 2013. Image retrieval using textual cues//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney， Australia： IEEE： 3040-3047 ［DOI： 10.1109/ICCV.2013.378http://dx.doi.org/10.1109/ICCV.2013.378］

Mishra A， Shekhar S， Singh A K and Chakraborty A. 2019. OCR-VQA： visual question answering by reading text in images//Proceedings of 2019 International Conference on Document Analysis and Recognition. Sydney， Australia： IEEE： 947-952 ［DOI： 10.1109/ICDAR.2019.00156http://dx.doi.org/10.1109/ICDAR.2019.00156］

Papineni K， Roukos S， Ward T and Zhu W J. 2002. BLEU： a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia， USA： ACL： 311-318 ［DOI： 10.3115/1073083.1073135http://dx.doi.org/10.3115/1073083.1073135］

Pennington J， Socher R and Manning C. 2014. GloVe： global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha， Qatar： ACL： 1532-1543 ［DOI： 10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162］

Powalski R， Borchmann Ł， Jurkiewicz D， Dwojak T， Pietruszka M and Pałka G. 2021. Going full-TILT boogie on document understanding with text-image-layout transformer//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 732-747 ［DOI： 10.1007/978-3-030-86331-9_47http://dx.doi.org/10.1007/978-3-030-86331-9_47］

Qian Y J， Santus E， Jin Z J， Guo J and Barzilay R. 2019. GraphIE： a graph-based framework for information extraction//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Minneapolis， Minnesota： Association for Computational Linguistics： 751-761 ［DOI： 10.18653/v1/N19-1082http://dx.doi.org/10.18653/v1/N19-1082］

Qiao Z， Zhou Y， Wei J， Wang W， Zhang Y， Jiang N， Wang H B and Wang W P. 2021. PIMNet： a parallel， iterative and mimicking network for scene text recognition//Proceedings of the 29th ACM International Conference on Multimedia. Virtual， China： ACM： 2046-2055 ［DOI： 10.1145/3474085.3475238http://dx.doi.org/10.1145/3474085.3475238］

Qiao Z， Zhou Y， Yang D B， Zhou Y C and Wang W P. 2020. SEED： semantics enhanced encoder-decoder framework for scene text recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle， USA： IEEE： 13525-13534 ［DOI： 10.1109/CVPR42600.2020.01354http://dx.doi.org/10.1109/CVPR42600.2020.01354］

Qin X G， Zhou Y， Guo Y H， Wu D Y， Tian Z H， Jiang N， Wang H B and Wang W P. 2021. Mask is all you need： rethinking mask R-CNN for dense and arbitrary-shaped scene text detection//Proceedings of the 29th ACM International Conference on Multimedia. Virtual， China： ACM： 414-423 ［DOI： 10.1145/3474085.3475178http://dx.doi.org/10.1145/3474085.3475178］

Rajpurkar P， Zhang J， Lopyrev K and Liang P. 2016. SQuAD： 100， 000+ questions for machine comprehension of text//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin， USA： The Association for Computational Linguistics： 2383-2392 ［DOI： 10.18653/v1/D16-1264http://dx.doi.org/10.18653/v1/D16-1264］

Ren S Q， He K M， Girshick R and Sun J. 2017. Faster R-CNN： towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（6）： 1137-1149 ［DOI： 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031］

Rong X J， Yi C C and Tian Y L. 2020. Unambiguous scene text segmentation with referring expression comprehension. IEEE Transactions on Image Processing， 29： 591-601 ［DOI： 10.1109/TIP.2019.2930176http://dx.doi.org/10.1109/TIP.2019.2930176］

Rong X J， Yi C C and Tian Y L. 2022. Unambiguous text localization， retrieval， and recognition for cluttered scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（3）： 1638-1652 ［DOI： 10.1109/TPAMI.2020.3018491http://dx.doi.org/10.1109/TPAMI.2020.3018491］

Sang E F T K and Veenstra J. 1999. Representing text chunks//Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. Bergen， Norway： The Association for Computer Linguistics： 173-179 ［DOI： 10.3115/977035.977059http://dx.doi.org/10.3115/977035.977059］

Sharma H and Jalal A S. 2022. Improving visual question answering by combining scene-text information. Multimedia Tools and Applications， 81（9）： 12177-12208 ［DOI： 10.1007/s11042-022-12317-0http://dx.doi.org/10.1007/s11042-022-12317-0］

Sidorov O， Hu R H， Rohrbach M and Singh A. 2020. TextCaps： a dataset for image captioning with reading comprehension//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 742-758 ［DOI： 10.1007/978-3-030-58536-5_44http://dx.doi.org/10.1007/978-3-030-58536-5_44］

Singh A， Natarajan V， Jiang Y， Chen X， Shah M， Rohrbach M， Batra D and Parikh D. 2018. Pythia —— a platform for vision and language research//SysML Workshop， NeurIPS. Montréal， Canada： MIT Press

Singh A， Natarajan V， Shah M， Jiang Y， Chen X L， Batra D， Parikh D and Rohrbach M. 2019a. Towards VQA models that can read//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 8309-8318 ［DOI： 10.1109/CVPR.2019.00851http://dx.doi.org/10.1109/CVPR.2019.00851］

Singh A， Pang G， Toh M， Huang J， Galuba W and Hassner T. 2021. TextOCR： towards large-scale end-to-end reasoning for arbitrary-shaped scene text//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 8798-8808 ［DOI： 10.1109/CVPR46437.2021.00869http://dx.doi.org/10.1109/CVPR46437.2021.00869］

Singh A K， Mishra A， Shekhar S and Chakraborty A. 2019b. From strings to things： knowledge-enabled VQA model that can read and reason//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 4601-4611 ［DOI： 10.1109/ICCV.2019.00470http://dx.doi.org/10.1109/ICCV.2019.00470］

Tang G Z， Xie L L， Jin L W， Wang J P， Chen J D， Xu Z， Wang Q Y， Wu Y Q and Li H. 2021. MatchVIE： exploiting match relevancy between entities for visual information extraction//Proceedings of the 30th International Joint Conference on Artificial Intelligence. Montreal， Canada： IJCAI.org： 1039-1045 ［DOI： 10.24963/ijcai.2021/144http://dx.doi.org/10.24963/ijcai.2021/144］

Tito R， Karatzas D and Valveny E. 2021. Document collection visual question answering//Proceedings of the 16th International Conference on Document Analysis and Recognition. Lausanne， Switzerland： Springer： 778-792 ［DOI： 10.1007/978-3-030-86331-9_50http://dx.doi.org/10.1007/978-3-030-86331-9_50］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Vedantam R， Zitnick C L and Parikh D. 2015. CIDEr： consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 4566-4575 ［DOI： 10.1109/CVPR.2015.7299087http://dx.doi.org/10.1109/CVPR.2015.7299087］

Wang H， Bai X， Yang M K， Zhu S G， Wang J and Liu W Y. 2021a. Scene text retrieval via joint text detection and similarity learning//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 4556-4565 ［DOI： 10.1109/CVPR46437.2021.00453http://dx.doi.org/10.1109/CVPR46437.2021.00453］

Wang J， Tang J H and Luo J B. 2020a. Multimodal attention with image text spatial relationship for OCR-based image captioning//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 4337-4345 ［DOI： 10.1145/3394171.3413753http://dx.doi.org/10.1145/3394171.3413753］

Wang J， Tang J H， Yang M K， Bai X and Luo J B. 2021c. Improving OCR-based image captioning by incorporating geometrical relationship//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 1306-1315 ［DOI： 10.1109/CVPR46437.2021.00136http://dx.doi.org/10.1109/CVPR46437.2021.00136］

Wang J P， Jin L W and Ding K. 2022a. LiLT： a simple yet effective language-independent layout Transformer for structured document understanding//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Dublin， Ireland： Association for Computational Linguistics： 7747-7757 ［DOI： 10.18653/v1/2022.acl-long.534http://dx.doi.org/10.18653/v1/2022.acl-long.534］

Wang J P， Liu C Y， Jin L W， Tang G Z， Zhang J X， Zhang S T， Wang Q Y， Wu Y Q and Cai M X. 2021b. Towards robust visual information extraction in real world： new dataset and novel solution. Proceedings of the AAAI Conference on Artificial Intelligence， 35（4）： 2738-2745 ［DOI： 10.1609/aaai.v35i4.16378http://dx.doi.org/10.1609/aaai.v35i4.16378］

Wang Q Z and Chan A B. 2019. Describing like humans： on diversity in image captioning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 4190-4198 ［DOI： 10.1109/CVPR.2019.00432http://dx.doi.org/10.1109/CVPR.2019.00432］

Wang W， Zhou Y， Lv J H， Wu D Y， Zhao G Q， Jiang N and Wang W P. 2022b. TPSNet： reverse thinking of thin plate splines for arbitrary shape scene text representation//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa， Portugal： ACM： 5014-5025 ［DOI： 10.1145/3503161.3547882http://dx.doi.org/10.1145/3503161.3547882］

Wang X Y， Liu Y L， Shen C H， Ng C C， Luo C J， Jin L W， Chan C S， van den Hengel A and Wang L W. 2020b. On the general value of evidence， and bilingual scene-text visual question answering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 10123-10132 ［DOI： 10.1109/CVPR42600.2020.01014http://dx.doi.org/10.1109/CVPR42600.2020.01014］

Wang Z K， Bao R D， Wu Q and Liu S. 2021d. Confidence-aware non-repetitive multimodal Transformers for TextCaps. Proceedings of the AAAI Conference on Artificial Intelligence， 35（4）： 2835-2843 ［DOI： 10.1609/aaai.v35i4.16389http://dx.doi.org/10.1609/aaai.v35i4.16389］

Wei J， Zhang Y， Zhou Y， Zeng G Y， Qiao Z， Guo Y H， Wu H Y， Wang H B and Wang W P. 2022. TextBlock： towards scene text spotting without fine-grained detection//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa， Portugal： ACM： 5892-5902 ［DOI： 10.1145/3503161.3548051http://dx.doi.org/10.1145/3503161.3548051］

Wu J J， Du J， Wang F R， Yang C， Jiang X Z， Hu J S， Yin B， Zhang J S and Dai L R. 2022. A multimodal attention fusion network with a dynamic vocabulary for TextVQA. Pattern Recognition， 122： #108214 ［DOI： 10.1016/j.patcog.2021.108214http://dx.doi.org/10.1016/j.patcog.2021.108214］

Xu G H， Niu S C， Tan M K， Luo Y C， Du Q and Wu Q. 2021a. Towards accurate text-based image captioning with content diversity exploration//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 12632-12641 ［DOI： 10.1109/CVPR46437.2021.01245http://dx.doi.org/10.1109/CVPR46437.2021.01245］

Xu Y， Xu Y H， Lyu T C， Cui L， Wei F R， Wang G X， Lu Y J， Florêncio D， Zhang C， Che W X， Zhang M and Zhou L D. 2021c. LayoutLMv2： multi-modal pre-training for visually-rich document understanding//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Online： Association for Computational Linguistics： 2579-2591 ［DOI： 10.18653/v1/2021.acl-long.201http://dx.doi.org/10.18653/v1/2021.acl-long.201］

Xu Y H， Li M H， Cui L， Huang S H， Wei F R and Zhou M. 2020. LayoutLM： pre-training of text and layout for document image understanding//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Virtual， USA： ACM： 1192-1200 ［DOI： 10.1145/3394486.3403172http://dx.doi.org/10.1145/3394486.3403172］

Xu Y H， Lv T C， Cui L， Wang G X， Lu Y J， Florêncio D， Zhang C and Wei F R. 2021b. LayoutXLM： multimodal pre-training for multilingual visually-rich document understanding ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2104.08836.pdfhttps://arxiv.org/pdf/2104.08836.pdf

Xu Y H， Lyu T C， Cui L， Wang G X， Lu Y J， Florêncio D， Zhang C and Wei F R. 2022. XFUND： a benchmark dataset for multilingual visually rich form understanding//Findings of the Association for Computational Linguistics： ACL 2022. Dublin， Ireland： Association for Computational Linguistics： 3214-3224 ［DOI： 10.18653/v1/2022.findings-acl.253http://dx.doi.org/10.18653/v1/2022.findings-acl.253］

Yang Z C， He X D， Gao J F， Deng L and Smola A. 2016. Stacked attention networks for image question answering//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 21-29 ［DOI： 10.1109/CVPR.2016.10http://dx.doi.org/10.1109/CVPR.2016.10］

Yang Z Y， Lu Y J， Wang J F， Yin X， Florêncio D， Wang L J， Zhang C， Zhang L and Luo J B. 2021. TAP： text-aware pre-training for text-VQA and text-caption//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 8747-8757 ［DOI： 10.1109/CVPR46437.2021.00864http://dx.doi.org/10.1109/CVPR46437.2021.00864］

Zeng G Y， Zhang Y， Zhou Y and Yang X M. 2021. Beyond OCR + VQA： involving OCR into the flow for robust and accurate TextVQA//Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event， China： ACM： 376-385 ［DOI： 10.1145/3474085.3475606http://dx.doi.org/10.1145/3474085.3475606］

Zhang P， Xu Y L， Cheng Z Z， Pu S L， Lu J， Qiao L， Niu Y and Wu F. 2020. TRIE： end-to-end text reading and information extraction for document understanding//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 1413-1422 ［DOI： 10.1145/3394171.3413900http://dx.doi.org/10.1145/3394171.3413900］

Zhang W Q， Shi H C， Guo J N， Zhang S Y， Cai Q P， Li J C， Luo S H and Zhuang Y T. 2022. MAGIC： multimodal relational graph adversarial inference for diverse and unpaired text-based image captioning ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2112.06558.pdfhttps://arxiv.org/pdf/2112.06558.pdf

Zhang X Y and Yang Q. 2021. Position-augmented Transformers with entity-aligned mesh for TextVQA//Proceedings of the 29th ACM International Conference on Multimedia. Virtual， China： ACM： 2519-2528 ［DOI： 10.1145/3474085.3475425http://dx.doi.org/10.1145/3474085.3475425］

Zhu C G， Zeng M and Huang X D. 2019. SDNet： contextualized attention-based deep network for conversational question answering ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/1812.03593.pdfhttps://arxiv.org/pdf/1812.03593.pdf

Zhu Q， Gao C Y， Wang P and Wu Q. 2021. Simple is not easy： a simple strong baseline for TextVQA and TextCaps. Proceedings of the AAAI Conference on Artificial Intelligence， 35（4）： 3608-3615 ［DOI： 10.1609/aaai.v35i4.16476http://dx.doi.org/10.1609/aaai.v35i4.16476］

Park S， Shin S， Lee B， Lee J， Surh J， Seo M and Lee H. 2019. CORD： a consolidated receipt dataset for post-ocr parsing//Workshop on Document Intelligence at NeurIPS 2019. Vancouver， Canada：［s.n.］

Kerroumi M， Sayem O and Shabou A. 2020. VisualWordGrid： information extraction from scanned documents using a multimodal approach ［EB/OL］. ［2022-09-10］. http：//arxiv.org/pdf/2010.02358.pdfhttp://arxiv.org/pdf/2010.02358.pdf

Hong T， Kim D， Ji M， Hwang W， Nam D and Park S. 2021. BROS： a pre-trained language model focusing on text and layout for better key information extraction from documents//Proceedings of AAAI Conference on Artificial Intelligence. Vancouver， Canada： AAAI： 10767-10775

Zhang P， Xu Y， Cheng Z， Pu S， Lu J， Qiao L， Niu Y and Wu F. 2020. TRIE： end-to-end text reading and information extraction for document understanding//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM

Yu W， Lu N， Qi X， Gong P and Xiao R. 2020. PICK： processing key information extraction from documents using improved graph learning-convolutional networks//Proceedings of the 25th International Conference on Pattern Recognition （ICPR）. Milan， Italy： IEEE： 4363-4370

Gu J， Kuen J， Morariu V I， Zhao H， Barmpalios N， Jain R， Nenkova A and Sun T. 2022. Unified pretraining framework for document understanding ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/2204.10939.pdfhttps://arxiv.org/pdf/2204.10939.pdf

Veit A， Matera T， Neumann L， Matas J and Belongie S J. 2016. COCO-Text： dataset and benchmark for text detection and recognition in natural images ［EB/OL］. ［2022-09-10］. https://arxiv.org/pdf/1601.07140.pdfhttps://arxiv.org/pdf/1601.07140.pdf

Zhu Q， Gao C， Wang P and Wu Q. 2020. Simple is not easy： a simple strong baseline for textvqa and textcaps //Proceedings of the AAAI Conference on Artificial Intelligence. New York， USA： AAAI

文章被引用时，请邮件提醒。

提交

暂无数据