大模型时代的光学文字识别:现状及展望
OCR in the Era of Large Models: Current Status and Prospects
- 2025年 页码:1-28
收稿日期:2025-03-09,
修回日期:2025-03-18,
录用日期:2025-04-09,
网络出版日期:2025-04-09
DOI: 10.11834/jig.250098
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2025-03-09,
修回日期:2025-03-18,
录用日期:2025-04-09,
网络出版日期:2025-04-09,
移动端阅览
本文回顾了光学字符识别(OCR)和多模态学习领域的最新技术进展,重点介绍了OCR大模型在多模态学习和多任务统一模型中的应用与前沿进展。随着深度学习技术的发展,OCR技术逐步从传统方法转向基于深度神经网络的端到端学习模型,涌现出大量具备高准确率和强泛化能力的OCR大模型。多模态大模型通过融合视觉、语言等多种感知通道,提高了模型在复杂场景下的理解与生成能力,而多任务统一大模型则通过构建通用架构,简化了模型设计,提升了多个OCR任务的处理效率。此外,文中还分析了OCR增强的多模态大模型、文档理解多模态大模型和针对特定OCR任务的多模态大模型的现状与挑战,探讨了OCR大模型面临的技术瓶颈和未来发展方向,包括提升分辨率处理能力、改进视觉标记压缩、增强结构化图形符号和复杂版面结构的感知与理解等,展望了其在文档数字化、程序自动化测试和智能教育等方面的广泛应用潜力。
With the rapid advancement of artificial intelligence, the emergence of large language models (LLMs) and multimodal large language models (MLLMs) has profoundly impacted optical character recognition (OCR), bringing about a paradigm shift in traditional OCR methods. This paper systematically reviews recent developments in OCR and multimodal learning, emphasizing the latest applications and advancements of large OCR models within multimodal and multi-task unified modeling. First, this paper defines the scope of large OCR models, categorizing them primarily into OCR multimodal large language models (OCR-MLLMs) and Omni-OCR models. OCR-MLLMs primarily leverage pre-trained LLMs and employ supervised fine-tuning (SFT) datasets, such as QA-based tasks, to learn from vast multimodal data across various OCR scenarios, ultimately leading to specialized multimodal OCR models that can handle diverse recognition and comprehension tasks. Conversely, the Omni-OCR model unifies multiple tasks within a general architecture, employing a large-scale parameterization to learn generalized OCR capabilities from extensive multi-task datasets. Specifically, this review covers four key aspects:1) OCR-Enhancing MLLMs: Early models such as LLaVA and MiniGPT-4 exhibited basic OCR capabilities but lagged significantly behind specialized OCR systems. Researchers have improved OCR performance by introducing specialized OCR datasets, as exemplified by models such as Qwen-VL and LLaVA-1.5. Another critical direction involves enhancing models' ability to process high-resolution images. Approaches like Monkey and InternLM-XComposer2-4KHD implemented sub-image cropping strategies, whereas Qwen2-VL adopted a ViT architecture with two-dimensional rotational positional encoding, enabling image encoding at arbitrary resolutions while mitigating semantic fragmentation caused by cropping.2) MLLMs for Document Understanding: MLLMs for document understanding can be categorized into OCR-free and OCR-dependent approaches. OCR-free methods eliminate traditional OCR preprocessing, directly processing document images for end-to-end understanding. Notable breakthroughs include the generation of synthetic dialogue training data via LLMs or MLLMs. For instance, TextSquare combines OCR annotations with images to construct extensive cross-domain visual question-answering (VQA) datasets. Additionally, advancements in cross-dataset integration and training-task designs have emerged. Examples include mPLUG-DocOwl, which consolidates diverse instructional datasets (e.g., documents, tables, charts, webpages, and natural images); Fox, which develops datasets for region-based OCR, translation, summarization, layout analysis, and dialogue; and DOGE, which constructs multi-granularity document parsing datasets, where full-page parsing tasks enhance models’ comprehensive perception of document content. Furthermore, innovations in visual encoding architectures tailored to document characteristics have been introduced, such as UReader's Shape-Adaptive Cropping Module, TextMonkey's Shifted Window Attention with token resampling, and Vary's dual vision vocabularies. OCR-dependent approaches enhance accuracy by integrating OCR outputs into model architectures. Examples include LayoutLLM, which incorporates LayoutLMv3's OCR features; DocLLM, which embeds layout information into attention mechanisms; and DocLayLLM, which achieves efficient multi-modal extension of LLMs by inserting two-dimensional positional tokens and leveraging chain-of-thought pre-training with annealing techniques.3) Specialized Multimodal Large Models for OCR Tasks: Specialized multimodal large models focus on specific OCR-related tasks, including chart analysis, table parsing, and multi-page document understanding, addressing the limitations of general-purpose models. In chart understanding, MMCA and ChartLlama utilize GPT-4-generated instruction data, ChartAssistant-S adopts chart-to-table pre-training to enhance visual representation, TinyChart employs code generation and visual token aggregation strategies, and ChartMoE, with its mixture-of-experts (MoE) architecture, mitigates catastrophic forgetting. For table parsing, Table-LLaVA curates extensive pre-training and fine-tuning datasets, significantly improving performance, while TabPedia integrates dual-resolution encoders and Meditative Tokens for adaptive multimodal feature fusion. Other models specialize in scientific table parsing through a two-stage fine-tuning approach. In document retrieval and multi-page understanding, advancements fall into two primary categories. The first is retrieval-augmented architecture design, where CREAM introduces a hierarchical retrieval strategy that refines results progressively from coarse to fine granularity, while PDF-WuKong develops an end-to-end sparse sampling method that directly filters relevant content within the model itself. The second is multimodal indexing paradigms, in which DSE and ColPali leverage multimodal large models to encode both document images and queries directly, eliminating the need for OCR-based document parsing. This approach preserves layout and visual information, thereby avoiding OCR-induced errors.4) Omni-OCR Model: Prior to the emergence of LLMs, the Omni-OCR Model was developed across four primary architectures. Document understanding pre-trained models (e.g., LayoutLM series) integrated text, spatial, and visual features to support tasks such as key information extraction and Q&A. Pix2Seq models (e.g., Donut, UDOP) perform multi-task processing without relying on OCR information, directly using image inputs combined with prompt mechanisms. OmniParser jointly models detection, recognition, and information extraction through a staged decoding process. Document parsing models (e.g., Nougat, KOSMOS-2.5, GOT) transformed documents into structured formats such as HTML or Markdown. Additionally, pixel-level unified models (e.g., DocRes, UPOCR) employed encoder-decoder architectures with task prompts to facilitate multi-task training and knowledge transfer for pixel-level tasks.Despite significant advancements, current large OCR models still face key challenges, including performance gaps in complex layout parsing, handwriting recognition, and historical manuscript analysis compared to specialized traditional models; inefficiencies related to high-resolution inputs, inference latency, and parameter redundancy, which hinder practical deployment; and insufficient fine-grained perception and logical reasoning in complex scenarios. Future developments should focus on four key areas: fine-grained feature extraction and document structural modeling for enhanced comprehension, large-scale self-supervised pre-training to improve generalizability, chain-of-thought reasoning for multimodal logic analysis and emerging tasks such as mathematical reasoning and multilingual processing, and model lightweighting via visual token compression and dynamic computation allocation to balance accuracy and efficiency. In practical applications, large OCR models exhibit significant potential in intelligent document processing, automated testing, digital education, historical document restoration, and oracle bone language decipherment. Since the early 21st century, OCR technology has undergone a paradigm shift from character recognition to semantic understanding. With the emergence of MLLMs, OCR systems are evolving from "text transcription tools" to "intelligent document understanding platforms," continuously expanding their capabilities to cross-modal reasoning, dynamic interaction, and intelligent decision support. As a foundational AI-driven technology, large OCR models are expected to play a pivotal role in driving digital transformation across industries, accelerating the intelligent evolution of document processing, text-image understanding, cultural heritage preservation, finance, and education. Moreover, they will provide robust technical support for knowledge management and innovation at all levels of society.
Appalaraju S , Jasani B , Kota B U , Xie Y S and Manmatha R . 2021 . DocFormer: end-to-end transformer for document understanding // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 993 - 1003
Appalaraju S , Tang P , Dong Q , Sankaran N , Zhou Y C and Manmatha R . 2024 . DocFormerv 2 : local features for document understanding //Proceedings of the 38th AAAI Conference on Artificial Intelligence. Vancouver, Canada : AAAI: 709 - 718 [ DOI: 10.1609/aaai.v38i2.27828 http://dx.doi.org/10.1609/aaai.v38i2.27828 ]
Anthropic Team . 2024 . Claude 3 Haiku: our fastest model yet [EB/OL]. [ 2024-09-03 ]. https://www.anthropic.com/news/claude-3-haiku https://www.anthropic.com/news/claude-3-haiku
Agrawal P , Antoniak S , Hanna E B , Bout B , Chaplot D , Chudnovsky J , Costa D , Monicault B D , Garg S , Gervet T , Ghosh S , Héliou A , Jacob P , Jiang A Q. , Khandelwal K , Lacroix T , Lample G , Casas D L , Lavril T , Scao T L , L A , Marshall W , Martin L , Mensch A , Muddireddy P , Nemychnikova V , Pellat M , Platen P V , Raghuraman N , Rozière B , Sablayrolles A , Saulnier L , Sauvestre R , Shang W , Soletskyi R , Stewart L , Stock P , Studnia J , Subramanian S , Vaze S , Wang T and Yang S . 2024 . Pixtral 12 B [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2410.07073 https://arxiv.org/abs/2410.07073
Bai J Z , Bai S , Yang S S , Wang S J , Tan S N , Wang P , Lin J Y , Zhou C and Zhou J R . 2023 . Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond [EB/OL]. [ 2023-10-13 ]. https://arxiv.org/pdf/2308.12966 https://arxiv.org/pdf/2308.12966
Beyer L , Steiner A , Pinto A S , Kolesnikov A , Wang X , Salz D , Neumann M , Alabdulmohsin I , Tschannen M , Bugliarello E , Unterthiner T , Keysers D , Koppula S , Liu F , Grycner A , Gritsenko A , Houlsby N , Kumar M , Rong K , Eisenschlos J , Kabra R , Bauer M , Bošnjak M , Chen X , Minderer M , Voigtlaender P , Bica I , Balazevic I , Puigcerver J , Papalampidi P , Henaff O , Xiong X , Soricut R , Harmsen J , Zhai X . 2024 . PaliGemma: a versatile 3B VLM for transfer [EB/OL]. [ 2024-09-03 ]. https://arxiv.org/abs/2407.07726 https://arxiv.org/abs/2407.07726
Blecher L , Cucurull G , Scialom T and Stojnic R . 2024 . Nougat: neural optical understanding for academic documents // Proceedings of the Twelfth International Conference on Learning Representations . Vienna, Austria
Biten A F , Tito R , Mafla A , Gomez L , Rusinol M , Valveny E , Jawahar CV and Karatzas D . 2019 . Scene text visual question answering // Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision . Seoul, Korea (South) : IEEE: 4291 - 4301 [ DOI: 10.1109/ICCV.2019.00439 http://dx.doi.org/10.1109/ICCV.2019.00439 ]
Borchmann Ł , Pietruszka M , Stanislawek T , Jurkiewicz D , Turski M , Szyndler K and Graliński F . 2021 . DUE: end-to-end document understanding benchmark // Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 2021 . Online : NeurIPS
Brown T B , Mann B , Ryder N , Subbiah M , Kaplan J , Dhariwal P , Neelakantan A , Shyam P , Sastry G , Askell A , Agarwal S , Herbert-Voss A , Krueger G , Henighan T , Child R , Ramesh A , Ziegler D M , Wu J , Winter C , Hesse C , Chen M , Sigler E , Litwin M , Gray S , Chess B , Clark J , Berner C , McCandlish S , Radford A , Sutskever I and Amodei D . 2020 . Language models are few-shot learners // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver, Canada : Curran Associates Inc.: 1877 - 1901 [ DOI: 10.5555/3495724.3495883 http://dx.doi.org/10.5555/3495724.3495883 ]
Chen W H , Wang H M , Chen J S , Zhang Y K , Wang H , Li S Y , Zhou X Y and Wang W Y . 2020 . TabFact: a large-scale dataset for table-based fact verification // Proceedings of the Eighth International Conference on Learning Representations . Virtual : [s. n.]
Chen Z , Wang W Y , Wang W H , Cui E F , Gao Z W , Zhu X Z , Lu L W , Lu T , Qiao Y , Dai J F . 2024a . InternVL 1 . 2 : scaling up LLM to 34 B [EB/OL]. [ 2024-09-03 ]. https://internvl.github.io/blog/2024-02-21-InternVL-1.2 https://internvl.github.io/blog/2024-02-21-InternVL-1.2
Chen Z , Wang W Y , Tian H , Ye S L , Gao Z W , Cui E F , Tong W W , Hu K Z , Luo J P , Ma Z , Ma J , Wang J Q , Dong X Y , Yan H , Guo H W , He C H , Shi B T , Jin Z J , Xu C , Wang B , Wei X J , Li W , Zhang W J , Zhang B , Cai P L , Wen L C , Yan X C , Dou M , Lu L W , Zhu X Z , Lu T , Lin D H , Qiao Y , Dai J F and Wang W H . 2024b . How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites . Science China Information Sciences , 67 ( 12 ): 220101 [ DOI: 10.1007/s11432-024-4231-5 http://dx.doi.org/10.1007/s11432-024-4231-5 ]
Chen Z , Wang W Y , Cao Y , Liu Y Z , Gao Z W , Cui E F , Zhu J G , Ye S L , Tian H , Liu Z Y , Gu L X , Wang X H , Li Q Y , Ren Y M , Chen Z X , Luo J P , Wang J H , Jiang T , Wang B , He C H , Shi B T , Zhang X C , Lv H , Wang Y , Shao W Q , Chu P , Tu Z Y , He T , Wu Z Y , Deng H P , Ge J Y , Chen K , Zhang K P , Wang L M , Dou M , Lu L W , Zhu X Z , Lu T , Lin D H , Qiao Y , Dai J F and Wang W H . 2024c . Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2412.05271 https://arxiv.org/abs/2412.05271
Chen J , Zhang R Y , Zhou Y F , Yu T , Dernoncourt F , Gu J X , Rossi R A. , Chen C Y and Sun T . 2025 . SV-RAG: LoRA-contextualizing adaptation of MLLMs for long document understanding // Proceedings of the Thirteenth International Conference on Learning Representations . EXPO, Singapore : [s. n.]
Cho J , Mahata D , Irsoy O , He Y J and Bansal M . 2024 . M 3 DocRAG: multi-modal retrieval is what you need for multi-page multi-document understanding [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2411.04952 https://arxiv.org/abs/2411.04952
Deng C , Yuan J L , Bu P , Wang P J , Li Z Z , Xu J , Li X H , Gao Y , Song J , Zheng B and Liu C L . 2024 . LongDocURL: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2412.18424 https://arxiv.org/abs/2412.18424
Devlin J , Chang M W , Lee K and Toutanova K . 2019 . BERT: pre-training of deep bidirectional transformers for language understanding\\Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Minneapolis, Minnesota : Association for Computational Linguistics : 4171 - 4186 [ DOI: 10.18653/v1/N19-1423 http://dx.doi.org/10.18653/v1/N19-1423 ]
Deitke M , Clark C , Lee S , Tripathi R , Yang Y , Park J S , Salehi M , Muennighoff N , Lo K , Soldaini L , Lu J , Anderson T , Bransom E , Ehsani K , Ngo H , Chen Y , Patel A , Yatskar M , Callison-Burch C , Head A , Hendrix R , Bastani F , VanderBilt E , Lambert N , Chou Y , Chheda A , Sparks J , Skjonsberg S , Schmitz M , Sarnat A , Bischoff B , Walsh P , Newell C , Wolters P , Gupta T , Zeng K-H , Borchardt J , Groeneveld D , Nam C , Lebrecht S , Wittlif C , Schoenick C , Michel O , Krishna R , Weihs L , Smith N A , Hajishirzi H , Girshick R , Farhadi A , Kembhavi A . 2024 . Molmo and PixMo: open weights and open data for state-of-the-art vision-language models [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2409.17146 https://arxiv.org/abs/2409.17146
Dong X Y, Zhang P, Zang Y H, Cao Y H, Wang B, Ouyang L K, Zhang S Y, Duan H D, Zhang W W, Li Y N, Yan H, Gao Y, Chen Z, Zhang X Y, Li W, Li J W, Wang W H, Chen K, He C H, Zhang X C, Dai J F, Qiao Y, Lin D H and Wang J Q. 2024 . InternLM-XCompose r2 - 4 KHD: a pioneering large vision-language model handling resolutions from 336 pixels to 4K HD //Proceedings of the 38th International Conference on Neural Information Processing Systems. Vancouver, Canada : Curran Associates Inc.: 42566 - 42592
Duan C , Jiang Q Y , Fu P , Chen J M , Li S X , Wang Z N , Guo S and Luo F J . 2025 . InstructOCR: instruction boosting scene text spotting // Proceedings of the 39th AAAI Conference on Artificial Intelligence . Philadelphia, USA : AAAI .
Faysse M , Sibille H , Wu T , Omrani B , Viaud G , Hudelot C and Colombo P . 2025 . ColPali: efficient document retrieval with vision language models // Proceedings of the Thirteenth International Conference on Learning Representations . EXPO, Singapore : [s. n.]
Fan X R , Ji T , Jiang C H , Li S , Jin S J , Song S R , Wang J K , Hong B Y , Chen L , Zheng G D , Zhang M , Huang C S , Zheng R , Xi Z H , Zhou Y H , Dou S H , Ye J J , Yan H , Gui T , Zhang Q , Qiu X P , Huang X J , Wu Z X and Jiang Y G . 2024 . Poly-visual-expert vision-language models // Proceedings of the First Conference on Language Modeling . Philadelphia, USA
Feng H , Wang Z J , Tang J Q , Lu J H , Zhou W G , Li H Q and Huang C . 2023 . UniDoc: a universal large multimodal model for simultaneous text detection, recognition, spotting and understanding [EB/OL]. [ 2023-09-02 ]. https://arxiv.org/pdf/2308.11592 https://arxiv.org/pdf/2308.11592
Feng H , Liu Q , Liu H , Tang J Q , Zhou W G , Li H Q and Huang C . 2024 . DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding . Science China Information Sciences , 67 ( 12 ): 220106 [ DOI: 10.1007/s11432-024-4250-y http://dx.doi.org/10.1007/s11432-024-4250-y ]
Fujitake M . 2024 . LayoutLLM: large language model instruction tuning for visually rich document understanding // Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation . Torino, Italia : ELRA and ICCL: 10219 – 10224
Fu L , Yang B , Kuang Z B , Song J J , Li Y Z , Zhu L H , Luo Q D , Wang X Y , Lu H , Huang M X , Li Z , Tang G Z , Shan B , Lin C H , Liu Q , Wu B H , Feng H , Liu H , Huang C , Tang J Q , Chen W , Jin L W , Liu Y L and Bai X . 2024 . OCRBench v2: an improved benchmark for evaluating large multimodal models on visual text localization and reasoning [EB/OL]. [ 2024-12-31 ]. https://www.arxiv.org/abs/2501.00321 https://www.arxiv.org/abs/2501.00321
Gao L C , Li Y B , Du L , Zhang X P , Zhu Z Y , Lu N , Jin L W , Huang Y S , Tang Z . 2022 . A survey on table recognition technology . Journal of Image and Graphics , 27 ( 6 ): 1898 - 1917 [ DOI: 10.11834/jig.220152 http://dx.doi.org/10.11834/jig.220152 ]
高良才 , 李一博 , 都林 , 张新鹏 , 朱子仪 , 卢宁 , 金连文 , 黄永帅 , 汤帜 . 2022 . 表格识别技术研究进展 . 中国图象图形学报 , 27 ( 6 ): 1898- 1917 [ DOI: 10.11834/jig.220152 http://dx.doi.org/10.11834/jig.220152 ]
Gao P , Han J M , Zhang R R , Lin Z Y , Geng S J , Zhou A J , Zhang W , Lu P , He C H , Yue X Y , Li H S and Qiao Y . 2023 . LLaMA-Adapter V2: parameter-efficient visual instruction model [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2304.15010 https://arxiv.org/abs/2304.15010
Gemini Team Google . 2024 . Gemini 1 . 5 : Unlocking multimodal understanding across millions of tokens of context [EB/OL]. [ 2024-09-03 ]. https://arxiv.org/abs/2403.05530 https://arxiv.org/abs/2403.05530
Guan H S , Yang H X , Wang X Y , Han S W , Liu Y G , Jin L W , Bai X and Liu Y L . 2024 . Deciphering oracle bone language with diffusion models // Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics . Bangkok, Thailand : Association for Computational Linguistics: 15554 – 15567
Guo Z H , Xu R , Yao Y , Cui J B , Ni Z L , Ge C J , Chua T S , Liu Z Y , Huang G . 2024 . LLaVA-UHD: an LMM perceiving any aspect ratio and high-resolution Images // Proceedings of the 18th European Conference on Computer Vision . Milano, Italy : Springer: 390 - 406 [ DOI: 10.1007/978-3-031-73010-8_23 http://dx.doi.org/10.1007/978-3-031-73010-8_23 ]
Han Y C , Zhang C , Chen X , Yang X , Wang Z B , Yu G , Fu B and Zhang H W . 2023 . ChartLlama: a multimodal LLM for chart understanding and generation [EB/OL]. [ 2023-11-27 ]. https://arxiv.org/pdf/2311.16483 https://arxiv.org/pdf/2311.16483
Hong T , Kim D , Ji M , Hwang W , Nam D and Park S . 2022 . BROS: a pre-trained language model focusing on text and layout for better key information extraction from documents // Proceedings of the 36th AAAI Conference on Artificial Intelligence . Online : AAAI: 10767 - 10775 [ DOI: 10.1609/aaai.v36i10.21322 http://dx.doi.org/10.1609/aaai.v36i10.21322 ]
Hong W Y , Wang W H , Lv Q S , Xu J Z , Yu W M , Ji J H , Wang Y , Wang Z H , Zhang Y X , Li J Z , Xu B , Dong Y X , Ding M , Tang J . 2024a . CogAgent: a visual language model for GUI agents // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 14281 - 14290
Hong W , Wang W , Ding M , Yu W , Lv Q , Wang Y , Cheng Y , Huang S , Ji J , Xue Z , Zhao L , Yang Z , Gu X , Zhang X , Feng G , Yin D , Wang Z , Qi J , Song X , Zhang P , Liu D , Xu B , Li J , Dong Y , Tang J . 2024b . CogVLM2: visual language models for image and video understanding [EB/OL]. [ 2024-09-03 ]. https://arxiv.org/abs/2408.16500 https://arxiv.org/abs/2408.16500
Hu W B , Xu Y F , Li Y , Li W Y , Chen Z Y and Tu Z W . 2024a . BLIVA: a simple multimodal LLM for better handling of text-rich visual questions // Proceedings of the 38th AAAI Conference on Artificial Intelligence . Vancouver, Canada : AAAI: 2256 - 2264 [ DOI: 10.1609/aaai.v38i3.27999 http://dx.doi.org/10.1609/aaai.v38i3.27999 ]
Hu A W , Xu H Y , Ye J B , Yan M , Zhang L , Zhang B , Li C , Zhang J , Jin Q , Huang F and Zhou J R . 2024b . mPLUG-DocOwl 1 . 5: unified structure learning for OCR-free document understanding // Findings of the Association for Computational Linguistics: EMNLP 2024. Florida, USA : Association for Computational Linguistics: 3096 - 3120 [ DOI: 10.18653/v1/2024.findings-emnlp.175 http://dx.doi.org/10.18653/v1/2024.findings-emnlp.175 ]
Hu A W , Xu H Y , Zhang L , Ye J B , Yan M , Zhang J , Jin Q , Huang F and Zhou J R . 2024c . mPLUG-DocOwl2: high-resolution compressing for OCR-free multi-page document understanding [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2409.03420 https://arxiv.org/abs/2409.03420
Huang Z , Chen K , He J , Bai X , Karatzas D , Lu S and Jawahar C V . 2019 . Icdar2019 competition on scanned receipt ocr and information extraction//Proceedings of the 15th International Conference on Document Analysis and Recognition. Sydney, Australia : IEEE: 1516 - 1520 [ DOI: 10.1109/ICDAR.2019.00244 http://dx.doi.org/10.1109/ICDAR.2019.00244 ]
Huang Y P , Lv T C , Cui L , Lu Y T and Wei F R . 2022 . LayoutLMv 3 : pre-training for document AI with unified text and image masking //Proceedings of the 30th ACM International Conference on Multimedia. Lisboa, Portugal : ACM: 4083 - 4091 [ DOI: 10.1145/3503161.3548112 http://dx.doi.org/10.1145/3503161.3548112 ]
Huang M X , Liu Y L , Liang D K , Jin L W , Bai X . 2025a . Mini-Monkey: alleviating the semantic sawtooth effect for lightweight MLLMs via complementary image pyramid // Proceedings of the Thirteenth International Conference on Learning Representations . EXPO, Singapore : [s. n.]
Huang M Y , Lai H , Zhang X Y , Wu W J , Ma J , Zhang L L and Liu J . 2025b . EvoChart: a benchmark and a self-training approach towards real-world chart understanding // Proceedings of the 39th AAAI Conference on Artificial Intelligence . Pennsylvania, USA : AAAI .
Jaume G , Ekenel H K and Thiran J P . 2019 . FUNSD: a dataset for form understanding in noisy scanned documents // Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops . Sydney, Australia : IEEE: 1 - 6 [ DOI: 10.1109/ICDARW.2019.10029 http://dx.doi.org/10.1109/ICDARW.2019.10029 ]
Kardas M , Czapla P , Stenetorp P , Ruder S , Riedel S , Taylor R and Stojnic R . 2020 . Axcell: automatic extraction of results from machine learning papers // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing . Online : Association for Computational Linguistics: 8580 - 8594 [ DOI: 10.18653/V1/2020.EMNLP-MAIN.692 http://dx.doi.org/10.18653/V1/2020.EMNLP-MAIN.692 ]
Kembhavi A , Salvato M , Kolve E , Seo M , Hajishirzi H and Farhadi A . 2016 . A diagram is worth a dozen images // Proceedings of the 14th European Conference on Computer Vision . Amsterdam, The Netherlands : Springer: 235 - 251 [DOI:10.1007/978-3-319-46493-0\_ 15 ]
Kim G , Hong T , Yim M , Nam J , Park J , Yim J , Hwang W , Yun S , Han D and Park S . 2022 . OCR-free document understanding transformer // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv, Israel : Springer: 498 - 517 [ DOI: 10.1007/978-3-031-19815-1_29 http://dx.doi.org/10.1007/978-3-031-19815-1_29 ]
Kim G , Lee H , Kim D , Jung H , Park S , Kim Y , Yun S , Kil T , Lee B and Park S . 2023 . Visually-situated natural language understanding with contrastive reading model and frozen large language models // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . Singapore : Association for Computational Linguistics: 11989 – 12010 [ DOI: 10.18653/v1/2023.emnlp-main.735 http://dx.doi.org/10.18653/v1/2023.emnlp-main.735 ]
Kuang J F , Hua W , Liang D K , Yang M K , Jiang D Q , Ren B and Bai X . 2023 . Visual information extraction in the wild: practical dataset and end-to-end solution // Proceedings of the 17th International Conference on Document Analysis and Recognition . California, USA : Springer: 36 - 53 [DOI:10.1007/978-3-031-41731-3\_ 3 ]
Khattab O and Zaharia M . 2020 . ColBERT: efficient and effective passage search via contextualized late interaction over BERT // Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval . New York, USA : ACM: 39 – 48 [ DOI: 10.1145/3397271.3401075 http://dx.doi.org/10.1145/3397271.3401075 ]
Lee K , Joshi M , Turc I R , Hu H X , Liu F Y , Eisenschlos J M , Khandelwal U , Shaw P , Chang M W and Toutanova K . 2023 . Pix2Struct: screenshot parsing as pretraining for visual language understanding // Proceedings of the 40th International Conference on Machine Learning . Hawaii, USA : PMLR: 18893 - 18912
Li P Z , Gu J X , Kuen J , Morariu V I , Zhao H D , Jain R , Manjunatha V and Liu H F . 2021 . SelfDoc: self-supervised document representation learning // Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE: 5652 - 5660 [ DOI: 10.1109/CVPR46437.2021.00560 http://dx.doi.org/10.1109/CVPR46437.2021.00560 ]
Li J N , Li D X , Savarese S and Hoi S . 2023 . BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models // Proceedings of the 40th International Conference on Machine Learning . Hawaii, USA : PMLR: 19730 - 19742
Li Z , Yang B , Liu Q , Ma Z Y , Zhang S , Yang J X , Sun Y B , Liu Y L and Bai X . 2024a . Monkey: image resolution and text label are important things for large multi-modal models // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 26763 - 26773
Li X , Wu Y F , Jiang X H , Guo Z H , Gong M M , Cao H Y , Liu Y S , Jiang D Q and Sun X . 2024b . Enhancing visual document understanding with contrastive learning in large visual-language models // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 15546 - 15555
Li B , Zhang K C , Zhang H , Guo D , Zhang R R , Li F , Zhang Y H , Liu Z W , Li C Y . 2024c . LLaVA-NeXT: stronger LLMs supercharge multimodal capabilities in the wild [EB/OL]. [ 2024-09-03 ]. https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms
Li Y W , Zhang Y C , Wang C Y , Zhong Z S , Chen Y X , Chu R H , Liu S T a nd Jia J Y . 2024d . Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2403.18814 https://arxiv.org/abs/2403.18814
Li B , Zhang Y H , Guo D , Zhang R R , Li F , Zhang H , Zhang K C , Zhang P Y , Li Y W , Liu Z W and Li C Y . 2024e . LLaVA-OneVision: easy visual task transfer [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2408.03326 https://arxiv.org/abs/2408.03326
Liao M H , Wan Z Y , Yao C , Chen K and Bai X . 2020 . Real-time scene text detection with differentiable binarization // Proceedings of the 34th AAAI Conference on Artificial Intelligence . New York, USA : AAAI: 11474 - 11481 [ DOI: https://doi.org/10.1609/aaai.v34i07.6812 http://dx.doi.org/https://doi.org/10.1609/aaai.v34i07.6812 ]
Liao W H , Wang J P , Li H L , Wang C Y , Huang J and Jin L W . 2025 . DocLayLLM: an efficient multi-modal extension of large language models for text-rich document understanding // Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE .
Lin Z N , Wang J P , Jin L W . 2023 . Visual information extraction deep learning method: a critical review . Journal of Image and Graphics , 28 ( 08 ): 2276 - 2297 [ DOI: 10.11834/jig.220904 http://dx.doi.org/10.11834/jig.220904 ]
林泽柠 , 汪嘉鹏 , 金连文 . 2023 . 视觉信息抽取的深度学习方法综述 . 中国图象图形学报 , 28 ( 08 ) : 2276-2297 [ DOI: 10.11834/jig.220904 http://dx.doi.org/10.11834/jig.220904 ]
Lin Z Y , Liu D Y , Zhang R R , Gao P , Qiu L T , Xiao H , Qiu H , Shao W Q , Chen K Q , Han J M , Huang S Y , Zhang Y C , He X M , Qiao Y and Li H S . 2024 . SPHINX : a mixer of weights , visual embeddings and image scales for multi-modal large language models//Proceedings of the 18th European Conference on Computer Vision. Milano, Italy : Springer: 36 - 55 [ DOI: 10.1007/978-3-031-73033-7_3 http://dx.doi.org/10.1007/978-3-031-73033-7_3 ]
Liu Y L , Chen H , Shen C H , He T , Jin L W , Wang L W . 2020 . Abcnet: Real-time scene text spotting with adaptive bezier-curve network // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 9809 - 9818 . [ DOI: 10. 1109 / CVPR42600. 2020. 00983 http://dx.doi.org/10.1109/CVPR42600.2020.00983 ]
Liu Z , Lin Y T , Cao Y , Hu H , Wei Y X , Zhang Z , Lin S and Guo B N . 2021 . Swin transformer: hierarchical vision transformer using shifted windows // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision . Montreal, Canada : IEEE: 10012 - 10022 [ DOI: 10.1109/ICCV48922.2021.00986 http://dx.doi.org/10.1109/ICCV48922.2021.00986 ]
Liu C Y , Chen X X , Luo C J , Jin L W , Xue Y , Liu Y L . 2021 . Deep learning methods for scene text detection and recognition . Journal of Image and Graphics , 26 ( 6 ): 1330 - 1367 [ DOI: 10.11834/jig.210044 http://dx.doi.org/10.11834/jig.210044 ]
刘崇宇 , 陈晓雪 , 罗灿杰 , 金连文 , 薛洋 , 刘禹良 . 2021 . 自然场景文本检测与识别的深度学习方法 . 中国图象图形学报 , 26 ( 6 ): 1330- 1367 [ DOI: 10.11834/jig.210044 http://dx.doi.org/10.11834/jig.210044 ]
Liu Z , Hu H , Lin Y T , Yao Z L , Xie Z D , Wei Y X , Ning J , Cao Y , Zhang Z , Dong L , Wei F R and Guo B N . 2022a . Swin transformer v2: scaling up capacity and resolution // Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, USA : IEEE: 12009 - 12019 [ DOI: 10.1109/CVPR52688.2022.01170 http://dx.doi.org/10.1109/CVPR52688.2022.01170 ]
Liu Y L , Shen C H , Jin L W , He T , Chen P , Liu C Y and Chen H . 2022b . ABCNet v2: adaptive Bezier-curve network for real-time end-to-end text spotting . IEEE Transactions on Pattern Analysis and Machine Intelligence , 44 ( 11 ): 8048 - 8064 [ DOI: 10.1109/TPAMI.2021.3107437 http://dx.doi.org/10.1109/TPAMI.2021.3107437 ]
Liu H T , Li C Y , Wu Q Y and Lee Y J . 2023 . Visual instruction tuning // Proceedings of the 38rd Conference on Neural Information Processing Systems . Vancouver, Canada : Curran Associates Inc.: 34892 - 34916
Liu Y L , Li H L , Bai X , Jin L W . 2023a . A brief analysis of ChatGPT: historical evolution, current applications, and future prospects . Journal of Image and Graphics , 28 ( 04 ): 0893 - 0902 [ DOI: 10.11834/jig.230110 http://dx.doi.org/10.11834/jig.230110 ]
刘禹良 , 李鸿亮 , 白翔 , 金连文 . 2023a . 浅析ChatGPT: 历史沿革、应用现状及前景展望 . 中国图象图形学报 , 28 ( 04 ) : 0893-0902 [ DOI: 10.11834/jig.230110 http://dx.doi.org/10.11834/jig.230110 ]
Liu C L , Jin L W , Bai X , Li X H , Yin F . 2023b . Frontiers of intelligent document analysis and recognition: review and prospects . Journal of Image and Graphics , 28 ( 08 ): 2223 - 2252 [ DOI: 10.11834/jig.221112 http://dx.doi.org/10.11834/jig.221112 ]
刘成林 , 金连文 , 白翔 , 李晓辉 , 殷飞 . 2023b . 文档智能分析与识别前沿:回顾与展望 . 中国图象图形学报 , 28 ( 08 ) : 2223-2252 [ DOI: 10.11834/jig.221112 http://dx.doi.org/10.11834/jig.221112 ]
Liu Y L , Li Z , Huang M X , Yang B , Yu W W , Li C Y , Yin X C , Liu C L , Jin L W and Bai X . 2024 . OCRBench: on the hidden mystery of OCR in large multimodal model . Science China Information Sciences , 67 ( 12 ): 220102 [ DOI: 10.1007/s11432-024-4235-6 http://dx.doi.org/10.1007/s11432-024-4235-6 ]
Liu H T , Li C Y , Li Y H , Li B , Zhang Y H , Shen S , Lee Y J . 2024a . LLaVA-NeXT: improved reasoning, OCR, and world knowledge [EB/OL]. [ 2024-09-03 ]. https://llava-vl.github.io/blog/2024-01-30-llava-next https://llava-vl.github.io/blog/2024-01-30-llava-next
Liu H T , Li C Y , Li Y H and Lee Y J . 2024b . Improved baselines with visual instruction tuning // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 26296 - 26306
Liu C L , Wei H R , Chen J Y , Kong L Y , Ge Z , Zhu Z N , Zhao L , Sun J J , Han C R and Zhang X Y . 2024c . Focus anywhere for fine-grained multi-page document understanding [EB/OL]. [ 2024-05-23 ]. https://arxiv.org/pdf/2405.14295 https://arxiv.org/pdf/2405.14295
Liu Y L , Yang B , Liu Q , Li Z , Ma Z Y , Zhang S and Bai X . 2024d . TextMonkey: An OCR-free large multimodal model for understanding document [EB/OL]. [ 2024-03-15 ]. https://arxiv.org/pdf/2403.04473 https://arxiv.org/pdf/2403.04473
Liu C H , Yin K , Cao H Y , Jiang X H , Li X , Liu Y S , Jiang D Q , Sun X and Xu L L . 2024e . HRVDA: high-resolution visual document assistant // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 15534 - 15545
Liu F X , Wang X Y , Yao W L , Chen J S , Song K Q , Cho S , Yacoob Y and Yu D . 2024f . MMC: advancing multimodal chart understanding with large-scale instruction tuning // Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Mexico City, Mexico : Association for Computational Linguistics: 1287 – 1310 [ DOI: 10.18653/v1/2024.naacl-long.70 http://dx.doi.org/10.18653/v1/2024.naacl-long.70 ]
Lu H Y , Liu W , Zhang B , Wang B X , Dong K , Liu B , Sun J X , Ren T Z , Li Z S , Yang H , Sun Y F , Deng C Q , Xu H W , Xie Z D , Ruan C . 2024a . DeepSeek-VL: towards real-world vision-language understanding [EB/OL]. [ 2024-09-03 ]. https://arxiv.org/abs/2403.05525 https://arxiv.org/abs/2403.05525
Lu J H , Yu H Y , Wang Y J , Ye Y J , Tang J Q , Yang Z W , Wu B H , Liu Q , Feng H , Wang H , Liu H , Huang C . 2024b . A bounding box is worth one token: interleaving layout and text in a large language model for document understanding [EB/OL]. [ 2024-09-03 ]. https://arxiv.org/abs/2407.01976 https://arxiv.org/abs/2407.01976
Luo C W , Cheng C X , Zheng Q and Yao C . 2023 . GeoLayoutLM: geometric pre-training for visual information extraction // Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 7092 - 7101 [ DOI: 10.1109/CVPR52729.2023.00685 http://dx.doi.org/10.1109/CVPR52729.2023.00685 ]
Luo C W , Shen Y F , Zhu Z Q , Zheng Q , Yu Z and Yao C . 2024 . LayoutLLM: layout instruction tuning with large language models for document understanding // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 15630 - 15640
Luo G , Zhou Y Y , Zhang Y X , Zheng X W , Sun X S and Ji R R . 2025 . Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models // Proceedings of the Thirteenth International Conference on Learning Representations . EXPO, Singapore : [s. n.]
Luan B Z , Feng H , Chen H , Wang Y H , Zhou W G and Li H Q . 2024 . TextCoT: zoom in for enhanced multimodal text-rich image understanding [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2404.09797 https://arxiv.org/abs/2404.09797
Lv T C , Huang Y P , Chen J Y , Cui L , Ma S M , Chang Y Y , Huang S H , Wang W H , Dong L , Luo W Y , Wu S X , Wang G X , Zhang C and Wei F R . 2023. Kosmos-2 . 5 : a multimodal literate model [EB/OL]. [ 2023-09-20 ]. https://arxiv.org/pdf/2309.11419 https://arxiv.org/pdf/2309.11419
Lyu P , Li Y , Zhou H , Ma W , Wan X , Xie Q , Wu L , Zhang C , Yao K , Ding E and Wang J . 2024 . StrucTexTv3: an efficient vision-language model for text-rich image perception, comprehension, and beyond [EB/OL]. [ 2024-06-04 ]. https://arxiv.org/pdf/2405.21013 https://arxiv.org/pdf/2405.21013
Ma X G , Lin S C , Li M H , Chen W H and Lin J . 2024a . Unifying multimodal retrieval via document screenshot embedding // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing . Florida, USA : ACL: 6492 – 6505 [ DOI: 10.18653/v1/2024.emnlp-main.373 http://dx.doi.org/10.18653/v1/2024.emnlp-main.373 ]
Ma Y B , Zang Y H , Chen L Y , Chen M Q , Jiao Y Z , Li X Z , Lu X Y , Liu Z Y , Ma Y , Dong X Y , Zhang P , Pan L M , Jiang Y G , Wang J Q , Cao Y X and Sun A X . 2024b . MMLONGBENCH-DOC: benchmarking long-context document understanding with visualizations // Proceedings of the 38th International Conference on Neural Information Processing Systems . Vancouver, Canada : Curran Associates Inc.: 95963 - 96010
Mathew M , Karatzas D and Jawahar C V . 2021 . DocVQA: a dataset for VQA on document images // Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision . Hawaii, USA : IEEE: 2199 - 2208 [ DOI: 10.1109/WACV48630.2021.00225 http://dx.doi.org/10.1109/WACV48630.2021.00225 ]
Mathew M , Bagal V , Tito R , Karatzas D , Valveny E and Jawahar C V . 2022 . InfographicVQA // Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision . Hawaii, USA : IEEE: 1697 - 1706 [ DOI: 10.1109/WACV51458.2022.00264 http://dx.doi.org/10.1109/WACV51458.2022.00264 ]
Masry A , Long D X , Tan J Q , Joty S and Hoque E . 2022 . ChartQA: a benchmark for question answering about charts with visual and logical reasoning .// Findings of the Association for Computational Linguistics: ACL 2022 . Dublin, Ireland : Association for Computational Linguistics: 2263 - 2279 [ DOI: 10.18653/V1/2022.FINDINGS-ACL.177 http://dx.doi.org/10.18653/V1/2022.FINDINGS-ACL.177 ]
Masry A , Shahmohammadi M , Parvez M R , Hoque E and Joty S . 2024 . ChartInstruct: instruction tuning for chart comprehension and reasoning // Findings of the Association for Computational Linguistics ACL 2024 . Bangkok, Thailand : Association for Computational Linguistics: 10387 - 10409
Meng F Q , Shao W Q , Lu Q F , Gao P , Zhang K P , Qiao Y and Luo P . 2024 . ChartAssisstant: a universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning // Findings of the Association for Computational Linguistics: ACL 2024 . Bangkok, Thailand : Association for Computational Linguistics: 7775 – 7803 [ DOI: 10.18653/v1/2024.findings-acl.463 http://dx.doi.org/10.18653/v1/2024.findings-acl.463 ]
Mishra A , Shekhar S , Singh A K and Chakraborty A . 2019 . OCR-VQA: visual question answering by reading text in images // Proceedings of the 15th International Conference on Document Analysis and Recognition . Sydney, Australia : IEEE: 947 - 952 [ DOI: 10.1109/ICDAR.2019.00156 http://dx.doi.org/10.1109/ICDAR.2019.00156 ]
Microsoft . 2024 . Phi-3-vision-128k-instruct [EB/OL]. [ 2024-09-03 ]. https://huggingface.co/microsoft/Phi-3-vision-128k-instruct https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
Mori S , Suen C Y and Yamamoto K . 1992 . Historical review of OCR research and development . Proceedings of the IEEE , 80 ( 7 ): 1029 - 1058 [ DOI: 10.1109/5.156468 http://dx.doi.org/10.1109/5.156468 ]
Nacson M S , Aberdam A , Ganz R , Avraham E B , Golts A , Kittenplon Y , Mazor S and Litman R . 2025 . DocVLM: make your VLM an efficient reader // Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville, USA : IEEE .
OpenAI . 2023a . GPT-4 technical report [EB/OL]. [ 2024-09-03 ]. https://arxiv.org/abs/2303.08774 https://arxiv.org/abs/2303.08774
OpenAI . 2023b . GPT-4V(ision) system card [EB/OL]. [ 2024-09-03 ]. https://openai.com/index/gpt-4v-system-card https://openai.com/index/gpt-4v-system-card
OpenGVLab Team . 2024a . InternVL2: better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy [EB/OL]. [ 2024-09-03 ]. https://internvl.github.io/blog/2024-07-02-InternVL-2.0/ https://internvl.github.io/blog/2024-07-02-InternVL-2.0/
OpenGVLab Team . 2024b . Mini-InternVL-Chat-2B-V1-5 [EB/OL]. [ 2024-09-03 ]. https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5 https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5
OpenBMB Team . 2024a . MiniCPM-V-2_6 [EB/OL]. [ 2024-09-03 ]. https://huggingface.co/openbmb/MiniCPM-V-2_6 https://huggingface.co/openbmb/MiniCPM-V-2_6
OpenBMB Team . 2024b . MiniCPM-V 2 . 0 : an efficient end-side MLLM with strong OCR and understanding capabilities [EB/OL]. [ 2024-09-03 ]. https://openbmb.vercel.app/minicpm-v-2-en https://openbmb.vercel.app/minicpm-v-2-en
Ouyang L K , Qu Y , Zhou H B , Zhu J W , Zhang R , Lin Q S , Wang B , Zhao Z Y , Jiang M , Zhao X M , Shi J , Wu F , Chu P , Liu M H , Li Z X , Xu C , Zhang B , Shi B T , Tu Z Y and He C H . 2024 . OmniDocBench: benchmarking diverse PDF document parsing with comprehensive annotations [EB/OL]. [ 2024-12-10 ]. https://arxiv.org/abs/2412.07626 https://arxiv.org/abs/2412.07626
Park S , Shin S , Lee B , Lee J , Surh J , Seo M and Lee H . 2019 . CORD: a consolidated receipt dataset for post-OCR parsing // Proceedings of the Workshop on Document Intelligence at NeurIPS 2019 . Vancouver, Canada : NeurIPS
Park J Y , Choi J Y , Park J and Han B . 2024 . Hierarchical visual feature aggregation for OCR-free document understanding // Proceedings of the 38th International Conference on Neural Information Processing Systems . Vancouver, Canada : Curran Associates Inc.: 105972 - 105996
Pasupat P and Liang P . 2015 . Compositional semantic parsing on semi-structured tables // Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing . Beijing, China : The Association for Computer Linguistics: 1470 - 1480 [ DOI: 10.3115/V1/P15-1142 http://dx.doi.org/10.3115/V1/P15-1142 ]
Peng Dezhi , Yang Zhenhua , Zhang Jiaxin , Liu Chongyu , Shi Yongxin , Ding Kai , Guo Fengjun and Jin Lianwen . 2024 . UPOCR: towards unified pixel-level OCR interface // Proceedings of the 41st International Conference on Machine Learning . Vienna, Austria : PMLR: 40271 - 40294
Qwen Team . 2024 . Introducing Qwen-VL [EB/OL]. [ 2024-09-03 ]. https://qwenlm.github.io/blog/qwen-vl https://qwenlm.github.io/blog/qwen-vl
Qwen Team . 2025 . Qwen 2 . 5 -VL technical report [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2502.13923 https://arxiv.org/abs/2502.13923
Rang M , Bi Z N , Liu C J , Wang Y H and Han K . 2024 . An Empirical Study of Scaling Law for Scene Text Recognition // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 15619 - 15629
Radford A , Kim J W , Hallacy C , Ramesh A , Goh G , Agarwal S , Sastry G , Askell A , Mishkin P , Clark J , Krueger G and Sutskever I . 2021 . learning transferable visual models from natural language supervision // Proceedings of the 38th International Conference on Machine Learning . Online : PMLR: 8748 - 8763
Shi , B G , Bai X , and Yao C . 2016 . An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition . IEEE Transactions on Pattern Analysis And Machine Intelligence , 39 ( 11 ): 2298 - 2304 [ DOI: 10.1109/TPAMI.2016.2646371 http://dx.doi.org/10.1109/TPAMI.2016.2646371 ]
Shan B , Fei X , Shi W , Wang A L , Tang G Z , Liao L , Tang J Q , Bai X and Huang C . 2024 . MCTBench: multimodal cognition towards text-rich visual scenes benchmark [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2410.11538 https://arxiv.org/abs/2410.11538
Singh A , Natarajan V , Shah M , Jiang Y , Chen X L , Batra D , Parikh D and Rohrbach M . 2019 . Towards VQA models that can read // Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . California, USA : IEEE: 8317 - 8326 [ DOI: 10.1109/CVPR.2019.00851 http://dx.doi.org/10.1109/CVPR.2019.00851 ]
Stanisławek T , Graliński F , Wróblewska A , Lipiński D , Kaliska A , Rosalska P , TopolskiB and BiecekP . 2021 . Kleister: key information extraction datasets involving long documents with complex layouts // Proceedings of the 16th International Conference on Document Analysis and Recognition . Lausanne, Switzerland : Springer: 564 - 579 [DOI:10.1007/978-3-030-86549-8\_ 36 ]
StepFun Team . 2024 . Vision LLM documentation [EB/OL]. [ 2024-09-03 ]. https://platform.stepfun.com/docs/llm/vision https://platform.stepfun.com/docs/llm/vision
Svetlichnaya S . 2020 . DeepForm: understand structured documents at scale [EB/OL]. https://wandb.ai/deepform/political-ad-extraction/benchmark https://wandb.ai/deepform/political-ad-extraction/benchmark
Tang Z N , Yang Z Y , Wang G X , Fang Y W , Liu Y , Zhu C G , Zeng M , Zhang C and Bansal M . Unifying vision , text, and layout for universal document processing// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver, Canada : IEEE: 19254 - 19264 [ DOI: 10.1109/CVPR52729.2023.01845 http://dx.doi.org/10.1109/CVPR52729.2023.01845 ]
Tang J Q , Lin C H , Zhao Z , Wei S , Wu B H , Liu Q , Feng H , Li Y , Wang S Q , Liao L , Shi W , Liu Y L , Liu H , Xie Y , Bai X and Huang C . 2024 . TextSquare: scaling up text-centric visual instruction tuning [EB/OL]. [ 2024-04-19 ]. https://arxiv.org/pdf/2404.12803 https://arxiv.org/pdf/2404.12803
Tanaka R , Nishida K and Yoshida S . 2021 . VisualMRC: machine reading comprehension on document images // Proceedings of the 35th AAAI Conference on Artificial Intelligence . British Columbia, Canada : AAAI: 13878 - 13888 [ DOI: 10.1609/AAAI.V35I15.17635 http://dx.doi.org/10.1609/AAAI.V35I15.17635 ]
Tanaka R , Iki T , Nishida K , Saito K and Suzuki J . 2024 . InstructDoc: a dataset for zero-shot generalization of visual document understanding with instructions // Proceedings of the 38th AAAI Conference on Artificial Intelligence . Vancouver, Canada : AAAI: 19071 - 19079 [ DOI: 10.1609/aaai.v38i17.29874 http://dx.doi.org/10.1609/aaai.v38i17.29874 ]
THUDM Team . 2024a . CogVLM2-LLaMA3-Chinese-Chat-19B [EB/OL]. [ 2024-09-03 ]. https://huggingface.co/THUDM/cogvlm2-llama3-chinese-chat-19B https://huggingface.co/THUDM/cogvlm2-llama3-chinese-chat-19B
THUDM Team . 2024b . GLM-4V-9B [EB/OL]. [ 2024-09-03 ]. https://huggingface.co/THUDM/glm-4v-9b https://huggingface.co/THUDM/glm-4v-9b
Vaswani A , Shazeer N , Parmar N , Uszkoreit N , Jones L , Gomez A N , Kaiser Ł and Polosukhin I . 2017 . Attention is all you need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach, USA : Curran Associates Inc.: 6000 - 6010
Wan J Q , Song S B , Yu W W , Liu Y L , Cheng W Q , Huang F , Bai X , Yao C and Yang Z B . OmniParser: a unified framework for text spotting key information extraction and table recognition // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, USA : IEEE: 15641 - 15653
Wang J P , Jin L W and Ding K . 2022 . LiLT: a simple yet effective language-independent layout transformer for structured document understanding // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics . Dublin, Ireland : Association for Computational Linguistics: 7747 - 7757 [ DOI: 10.18653/v1/2022.acl-long.534 http://dx.doi.org/10.18653/v1/2022.acl-long.534 ]
Wang Y H , Zhou W G , Feng H , Zhou K Y and Li H Q . 2023 . Towards improving document understanding: an exploration on text-grounding via MLLMs [EB/OL]. [ 2023-12-15 ]. https://arxiv.org/pdf/2311.13194 https://arxiv.org/pdf/2311.13194
Wang P , Bai S , Tan S N , Wang S J , Fan Z H , Bai J Z , Chen K Q , Liu X J , Wang J L , Ge W B , Fan Y , Dang K , Du M F , Ren X C , Men R , Liu D Y H , Zhou C , Zhou J R and Lin J Y . 2024a . Qwen2-VL: enhancing vision-language model's perception of the world at any resolution [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2409.12191 https://arxiv.org/abs/2409.12191
Wang D S , Raman N , Sibue M , Ma Z Q , Babkin P , Kaur S , Pei Y L , Nourbakhsh A and Liu X M . 2024b . DocLLM: a layout-aware generative language model for multimodal document understanding // Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics . Bangkok, Thailand : Association for Computational Linguistics: 8529 - 8548
Wang B , Xu C , Zhao X M , Ouyang L K , Wu F , Zhao Z Y , Xu R , Liu K W , Qu Y , Shang F K , Zhang B , Wei L Q , Sui Z H , Li W , Shi B T , Qiao Y , Lin D H and He C H . 2024c . MinerU: an open-source solution for precise document content extraction [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2409.18839 https://arxiv.org/abs/2409.18839
Wei H R , Liu C L , Chen J Y , Wang J , Kong L Y , Xu Y M , Ge Z , Zhao L , Sun J J , Peng Y , Han C R and Zhang X Y . 2024a. General OCR Theory: towards OCR-2 . 0 via a unified end-to-end model [EB/OL]. [ 2025-03-07 ]. https://arxiv.org/pdf/2409.01704 https://arxiv.org/pdf/2409.01704
Wei H R , Kong L Y , Chen J Y , Zhao L , Ge Z , Yang J R , Sun J J , Han C R and Zhang X Y . 2024b . Vary: scaling up the vision vocabulary for large vision-language models // Proceedings of the 18th European Conference on Computer Vision . Milano, Italy : Springer: 408 - 424 [ DOI: 10.1007/978-3-031-73235-5_23 http://dx.doi.org/10.1007/978-3-031-73235-5_23 ]
Wu Z Y , Chen X K , Pan Z Z , Liu X C , Liu W , Dai D M , Gao H Z , Ma Y Y , Wu C Y , Wang B X , Xie Z D , Wu Y , Hu K , Wang J W , Sun Y F , Li Y K , Piao Y S , Guan K , Liu A X , Xie X , You Y X , Dong K , Yu X K , Zhang H W , Zhao L , Wang Y S and Ruan C . 2024 . DeepSeek-VL2: mixture-of-experts vision-language models for advanced multimodal understanding [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2412.10302 https://arxiv.org/abs/2412.10302
Xia R Q , Mao S , Yan X C , Zhou H B , Zhang B , Peng H Y , Pi J H , Fu D C , Wu W J , Ye H C , Feng S Y , Wang B , Xu C , He C H , Cai P L , Dou M , Shi B T , Zhou S , Wang Y W , Wang B , Yan J C , Wu F and Qiao Y . 2024 . DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2406.11633 https://arxiv.org/abs/2406.11633
Xie X D , Yan H , Yin L , Liu Y , Ding J , Liao M H , Liu Y L , Chen W and Bai X . 2024 . PDF-WuKong: a large multimodal model for efficient long PDF reading with end-to-end sparse sampling [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2410.05970 https://arxiv.org/abs/2410.05970
Xu Y H , Li M H , Cui L , Huang S H , Wei F R and Zhou M . 2020 . LayoutLM: pre-training of text and layout for document image understanding // Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Online : ACM: 1192 - 1200 [ DOI : 10.1145/3394486.3403172 http://dx.doi.org/10.1145/3394486.3403172 ]
Xu Y , Xu Y H , Lv T C , Cui L , Wei F R , Wang G X , Lu Y J , Florencio D , Zhang C , Che W X , Zhang M and Zhou L D . 2021 . LayoutLMv 2 : multi-modal pre-training for visually-rich document understanding //Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online : Association for Computational Linguistics: 2579 - 2591 [ DOI: 10.18653/v1/2021.acl-long.201 http://dx.doi.org/10.18653/v1/2021.acl-long.201 ]
Xu Z Z , Qu B W , Qi Y Y , Du S N , Xu C J , Yuan C and Guo J . 2025 . ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding // Proceedings of the Thirteenth International Conference on Learning Representations . EXPO, Singapore : [s. n.]
ai Team . 2024 . Grok 1.5V: Introducing our most advanced model yet [EB/OL]. [ 2024-09-03 ]. https://x.ai/blog/grok-1.5v Yang Z B https://x.ai/blog/grok-1.5v(YangZB , TangJ, LiZ H, WangP F, WanJ Q, ZhongH M, LiuX J, YangM K, WangP, BaiS, JinL W and LinJ Y. 2024. CC-OCR: a comprehensive and challenging OCR benchmark for evaluating large multimodal models in literacy [EB/OL]. [ 2025-03-01 ]. https://arxiv.org/abs/2412.02210 Yang B H https://arxiv.org/abs/2412.02210(YangBH , ZhangY J, LiuD, FreitasA, LinC H. 2025a. Does table source matter? benchmarking and improving multimodal [EB/O L]. [ 2024-02-25 ]. https://arxiv.org/pdf/2501.13042 Yang Z H https://arxiv.org/pdf/2501.13042(YangZH , PengD Z, ShiY X, ZhangY Y, LiuC Y and JinL W. 2025b. Predicting the original appearance of damaged historical documents//Proceedings of the 39th AAAI Conference on Artificial Intelligence. Pennsylvania, USA: AAAI. YaoY, YuT, ZhangA, WangC, CuiJ, ZhuH, CaiT, LiH, ZhaoW, HeZ, ChenQ, ZhouH, ZouZ, ZhangH, HuS, ZhengZ, ZhouJ, CaiJ, HanX, ZengG, LiD, LiuZ, SunM. 2024. MiniCPM-V: a GPT-4V level MLLM on your Phone [EB/OL]. [ 2024-09-03 ]. https://arxiv.org/abs/2408.01800 Ye Q H https://arxiv.org/abs/2408.01800(YeQH , XuH Y, XuG H, YeJ B, YanM, ZhouY Y, WangJ Y, HuA W, ShiP C, ShiY Y, LiC L, XuY H, ChenH H, TianJ F, QianQ, ZhangJ, HuangF and ZhouJ R. 2023a. mPLUG-Owl: modularization empowers large language models with multimodality [EB/OL]. [ 2024-03-29 ]. https://arxiv.org/pdf/2304.14178 Ye J B https://arxiv.org/pdf/2304.14178(YeJB , HuA W, XuH Y, YeQ H, YanM, DanY H, ZhaoC L, XuG H, LiC L, TianJ F, QiQ, ZhangJ and HuangF. 2023b. mPLUG-DocOwl: modularized multimodal large language model for document understanding [EB/OL]. [ 2023-07-04 ]. https://arxiv.org/pdf/2307.02499 Ye J B https://arxiv.org/pdf/2307.02499(YeJB , HuA W, XuH Y, YeQ H, YanM, XuG H, LiC L, TianJ F, QianQ, ZhangJ, JinQ, HeL, LinX and HuangF. 2023c. UReader: universal OCR-free visually-situated language understanding with multimodal large language model// Findings of the Association for Computational Linguistics : EMNLP 2023. Singapore: Association for Computational Linguistics : 2841 - 2858 [DOI: 10.18653/v1/2023.findings-emnlp.187] YeQ H, XuH Y, YeJ B, YanM, HuA W, LiuH W, QianQ, ZhangJ, HuangF, ZhouJ R. 2023d. mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration //Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13040 - 13051 YuY, LiY, ZhangC, ZhangX, GuoZ, QinX, YaoK, HanJ, DingE and WangJ. 2023. StrucTexTv2: masked visual-textual prediction for document image pre-training//Proceedings of the Eleventh International Conference on Learning Representations. Kigali, Rwanda. Yu Y Q, Liao M H, Wu J H, Liao Y X, Zheng X Y and Zeng W. 2024. TextHawk: exploring efficient fine-grained perception of multimodal large language models [EB/OL]. [2024-04-14]. https://arxiv.org/pdf/2404.09204 Yu W W, Yang Z B, Wan J Q, Song S B, Tang J, Cheng W Q, Liu Y L and Bai X. 2025a. OmniParser V2: structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models [EB/OL]. [2025-03-01]. https://arxiv.org/abs/2502.16161 Yu S, Tang C Y, Xu B K, Cui J B, Ran J H, Yan Y K, Liu Z H, Wang S, Han X, Liu Z Y and Sun M S. 2025b. VisRAG: vision-based retrieval-augmented generation on multi-modality documents//Proceedings of the Thirteenth International Conference on Learning Representations. EXPO, Singapore: [s. n.] Zhao W C, Feng H, Liu Q, Tang J Q, Wu B H, Liao L, Wei S, Ye Y J, Liu H, Zhou W G, Li H Q and Huang C. 2024. TabPedia: towards comprehensive visual table understanding with concept synergy//Proceedings of the 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 7185-7212 Zheng M Y, Feng X W, Si Q Y, She Q Q, Lin Z, Jiang W B and Wang W P. 2024. Multimodal table understanding//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand: Association for Computational Linguistics: 9102-9124 Zong Z F, Ma B Q, Shen D Z, Song G L, Shao H, Jiang D Z, Li H S and Liu Y. 2024. MoVA: adapting mixture of vision experts to multimo dal context//Proceedings of the 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 103305-103333 Zhang Y Z, Zhang R Y, Gu J X, Zhou Y F, Lipka N, Yang D Y and Sun T. 2024a. LLaVAR: enhanced visual instruction tuning for text-rich image understanding [EB/OL]. [2024-02-02]. https://arxiv.org/pdf/2306.17107 Zhang L, Hu A W, Xu H Y, Yan M, Xu Y C, Jin Q, Zhang J and Huang F. 2024c. TinyChart: efficient chart understanding with program-of-thoughts learning and visual token merging//Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Florida, USA: ACL: 1882-1898 [DOI: 10.18653/v1/2024.emnlp-main.112] Zhang J X, Peng D Z, Liu C Y, Zhang P R and Jin L W. 2024d. DocRes: a generalist model toward unifying document image restoration tasks//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 15654-15664 Zhang P, Dong X Y, Zang Y H, Cao Y H, Qian R, Chen L, Guo Q P, Duan H D, Wang B, Ouyang L K, Zhang S Y, Zhang W W, Li Y N, Gao Y, Sun P, Zhang X Y, Li W, Li J W, Wang W H, Yan H, He C H, Zhang X C, Chen K, Dai J F, Qiao Y, Lin D H, Wang J Q. 2024e. InternLM-XComposer-2.5: a versatile large vision language model supporting long-contextual input and output [EB/OL]. [2024-09-03]. https://arxiv.org/abs/2407.03320 Zhang S, Yang B, Li Z, Ma Z Y, Liu Y L and Bai X. 2024f. Exploring the Capabilities of Large Multimodal Models on Dense Text//Proceedings of the 18th International Conference on Document Analysis and Recognition. Athens, Greece: Springer: 281–298 [DOI: 10.1007/978-3-031-70552-6_17] Zhang J X, Yu Y Q and Zhang Y. 2024g. CREAM: coarse-to-fine retrieval and multi-modal efficient tuning for document VQA//Proceedings of the 32nd ACM International Conference on Multimedia. Melbourne VIC, Australia: ACM: 925–934 [DOI: 10.1145/3664647.3680750] Zhang R S, Shao R, Chen G W, Zhou K W, Guan W L and Nie L Q. 2025a. FALCON: resolving visual redundancy and fragmentation in high-reso lution multimodal large language models via visual registers [EB/OL]. [2023-03-02]. https://arxiv.org/abs/2501.16297 Zhang J X, Yang W T, Lai S X, Xie Z C and Jin L W. 2025b. DocKylin: a large multimodal model for visual document understanding with efficient visual slimming//Proceedings of the 39th AAAI Conference on Artificial Intelligence. Pennsylvania, USA: AAAI. Zhang H T, Gao M F, Gan Z, Dufter P, Wenzel N, Huang F, Shah D, Du X Z, Zhang B W, Li Y H, Dodge S, You K, Yang Z, Timofeev A, Xu M Z, Chen H Y, Fauconnier J P, Lai Z F, You H X and Wang Z R. 2025c. MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning//Proceedings of the Eighth International Conference on Learning Representations. Virtual: [s. n.] Zhou S J, Zhang R Y, Zhou Y F and Chen C Y. 2024a. A high-quality text-rich image instruction tuning dataset via hybrid instruction generation [EB/OL]. [2025-03-01]. https://arxiv.org/abs/2412.16364 Zhou Y N, Chen Y X, Lin H K, Yang S Y, Zhu L, Qi Z G, Ma C and Shan Y. 2024b. DOGE: towards versatile visual document grounding and referring [EB/OL]. [2025-03-01]. https://arxiv.org/abs/2411.17125 Zhu S D, Dong W H, Song J, Wang Y B, Guo Y A and Zheng B. 2024a. HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models [EB/OL]. [2025-03-01]. https://arxiv.org/abs/2412.08378 Zhu D Y, Chen J, Shen X Q, Li X and Elhoseiny M. 2024b. MiniGPT-4: enhancing vision-language understanding with advanced large language models//Proceedings of the Twelfth International Conference on Learning Representations. Vienna, Austria Zhu Y, Zhang Y, Liu D, Xie C, Xiong Z, Zheng B and Guo S. 2025. Enhancing document understanding with group position embedding: a novel approach to incorporate layout information//Proceedings of the Thirteenth International Conference on Learning Representations. Singapore, Singapore.
相关作者
相关机构