多模态大模型面向电子文档视觉问答的数据生成
A Multimodal Large Models-based Method for Generating Visual Q&A Data for Electronic Document Image
- 2025年 页码:1-14
收稿日期:2024-10-08,
修回日期:2025-02-16,
录用日期:2025-02-25,
网络出版日期:2025-02-26
DOI: 10.11834/jig.240610
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-10-08,
修回日期:2025-02-16,
录用日期:2025-02-25,
网络出版日期:2025-02-26,
移动端阅览
目的
2
电子文档视觉问答数据生成技术旨在结合电子文档图像的文字内容与视觉信息,以生成问题及其对应答案。利用高质量的视觉指令微调数据集,可以显著提升多模态大型语言模型的文档阅读性能。目前,人工或模板方法生成的数据集存在数量不足和质量不高的问题。因此,本文设计了一种基于多模态大语言模型的电子文档图像问答数据生成方法。
方法
2
提出了一种基于多模态大型语言模型的大规模数据生成流程,该流程包括四个关键步骤:自我提问与回答、数量与格式检查、数据过滤和一致性检验。在第一阶段,通过输入电子文档图像及相应指令至多模态大型语言模型,初步生成多个问答对。第二步,进行数量与格式的检查。第三步,将合格的问答对及其对应图像和指令输入至多模态大型语言模型,以过滤掉与图像内容无关、回答错误或未使用外部知识的问答对。最后一步,针对同一问答对,利用多模态大型语言模型生成多个不同表述的问题,并检查回答的一致性,以剔除回答不一致的问答对。
结果
2
本文构建了一个高质量的数据集,包含324,546张图像和2,036,263个问答对。通过对问答对正确率的随机抽样统计,结果显示正确率为91.34%。此外,本文还在DocVQA等文档类问答数据集上测试了该数据集对多模态大语言模型性能的提升作用。微调实验结果表明,在LLaVA-OV和Deepseek-VL模型上,基于本数据集的微调能够提升DocVQA数据集上的平均归一化编辑相似度,分别提高了1.4%和2.6%。消融实验进一步表明,去除数据过滤步骤后,模型性能下降了1.3%。通过与人工标注数据DocVQA的互补性实验,结果表明,在DocVQA训练集基础上加入部分视觉问答数据集进行训练后,模型性能比仅使用DocVQA训练集微调时提升了1.3%。此外,与现有方法生成的数据集进行性能对比时,本文方法生成的数据集在模型性能提升方面表现最为显著。后续的后处理实验也进一步证明了所提出的数据集在生成问答对时仍具有一定的提升空间。
结论
2
本文提出的基于多模态大型语言模型的电子文档图片视觉问答数据生成方法,有效解决了现有数据集数量少、质量差的问题,显著提升了多模态大型语言模型的文档阅读理解能力。
Objective
2
Recent advancements in multimodal large language model(MLLM)have significantly revolutionized the field of visual question answering(VQA), especially in the domain of text-centric document understanding. These models have demonstrated remarkable improvements in tasks that involve the integration of visual and textual information, with several state-of-the-art models currently leading the field. One critical task in this area is the generation of VQA datasets for electronic documents, which entails combining the textual content embedded within document images with their visual components to generate meaningful and contextually relevant questions and corresponding answers. The integration of high-quality, fine-tuned datasets—specifically designed for multimodal instruction-following tasks—has been shown to greatly enhance the document comprehension capabilities of MLLM. However, existing VQA datasets, which are typically generated using manual annotation techniques or templating methods, face significant challenges in terms of both scale and quality. These limitations impede the scalability and overall effectiveness of training datasets for multimodal models. Consequently, this paper proposes an innovative method to automatically generate image-based VQA datasets for electronic documents by utilizing a multimodal large language model. The goal of this work is to address the existing gaps in dataset quality and quantity, thereby facilitating better training and fine-tuning of MLLM in the context of document-based visual question answering tasks.
Method
2
The proposed methodology involves the use of a large-scale data generation framework, powered by a multimodal large language model. This framework is divided into four distinct stages: self-question generation, quantity and format verification, data filtering, and consistency validation. In the initial stage, the multimodal large language model is tasked with generating multiple question-answer (Q&A) pairs by processing the input electronic document images alongside their corresponding textual descriptions. This stage capitalizes on the model’s ability to simultaneously analyze both the visual and textual elements of the document, enabling it to generate a diverse array of questions that cover various aspects of the content, such as factual inquiries, inferential reasoning, and contextual understanding. The second stage focuses on ensuring that the generated question-answer pairs meet both the required quantity and adhere to the correct formatting standards. This stage is critical for eliminating any inconsistencies, errors, or discrepancies in the formatting of the data, which could otherwise compromise the quality of the final dataset. In the third stage, data filtering is employed to refine the dataset by eliminating irrelevant or incorrect Q&A pairs. This process involves evaluating the generated question-answer pairs, along with their corresponding images and instructions, to identify and discard any irrelevant or improperly answered pairs. The purpose of this step is to ensure that the dataset contains only high-quality questions that require multimodal reasoning capabilities for accurate responses. The final stage involves consistency validation, wherein the multimodal large language model is used to generate multiple variations of the same question-answer pair. The objective of this stage is to verify that the answers remain consistent across different rephrasings of the same question. If inconsistencies in the answers are identified, those pairs are discarded. This step not only ensures the reliability and accuracy of the dataset but also helps improve the robustness of the dataset by introducing diverse question formulations. By systematically applying these four stages, the proposed method enables the generation of a large-scale, high-quality VQA dataset for electronic documents, which can then be leveraged to fine-tune multimodal large language models and enhance their performance in document understanding tasks.
Result
2
In this study, a high-quality dataset was constructed, consisting of 324,546 images and 2,036,263 corresponding Q&A pairs. The overall correctness rate of 91.34% was achieved through random sampling of a sufficiently large number of images and their associated Q&A pairs, followed by manual verification of the selected samples. The impact of this dataset on improving the performance of multimodal large language models in document-based question answering tasks, such as DocVQA, was rigorously evaluated. Fine-tuning experiments on the LLaVA-OV and Deepseek-VL models demonstrated improvements of 1.4% and 2.6%, respectively, in average normalized Levenshtein similarity on DocVQA. Additionally, ablation studies were conducted to assess the effectiveness of the data filtering process. These studies revealed that correctness filtering, relevance filtering, and external knowledge filtering each contributed to the enhancement of the performance of multimodal large language models when applied to the generated dataset. Interestingly, it was found that relevance filtering and external knowledge filtering did not conflict with one another. On the contrary, applying both filtering methods simultaneously resulted in better model performance than when either one was applied alone. Furthermore, the entire data filtering process resulted in a 1.3% performance improvement for the Deepseek-VL model on the DocVQA dataset. Complementarity experiments with the DocVQA dataset showed a 1.3% performance gain when the model was fine-tuned on both the DocVQA dataset and a subset of the visual Q&A dataset. This demonstrated the ability of the generated dataset to complement manually labeled data and showcased the effectiveness of the synthetic data generation process. To further validate the superiority of the proposed dataset generation method, comparisons were made between the model performance achieved using the ALLAVA and TG-Doc datasets and the model performance obtained using the generated data from the proposed method. Specifically, 1 million instruction samples were randomly selected from each of these datasets for full fine-tuning of the LLaVA-OV model. Experimental results indicated that the generated data from the proposed method led to the most significant improvement in model performance. Finally, it was observed that while the proposed dataset resulted in some improvement in model performance, the overall gain was somewhat limited. A more in-depth analysis revealed that redundant characters in the generated answers—such as unnecessary phrasing—contributed to a degradation in performance. To address this issue, a post-processing experiment was conducted using Qwen2.5-14B. By removing redundant content from the model's outputs, the post-processing technique significantly enhanced performance, indicating that further refinement in the dataset generation process could yield even better results.
Conclusion
2
The proposed method for multimodal large language model-driven generation of image-based VQA data for electronic documents effectively addresses the challenges of limited dataset size and poor data quality. The comprehensive evaluation of the dataset’s impact on model performance, along with the successful implementation of data filtering and post-processing strategies, demonstrates the potential of this approach to improve the robustness and accuracy of multimodal models in document-based visual question answering tasks. Future work could further refine this process to eliminate redundant content and optimize the generated dataset for even better performance.
Yan H , Liu Y L , Jin L W and Bai X . 2023 . The development, application, and future of LLM similar to ChatGPT . Journal of Image and Graphics , 28 ( 09 ): 2749 - 2762
严昊 , 刘禹良 , 金连文 , 白翔 . 2023 . 类ChatGPT大模型发展、应用和前景 . 中国图象图形学报 , 28 ( 09 ): 2749 - 2762 [ DOI: 10.11834/jig.230536 http://dx.doi.org/10.11834/jig.230536 .]
Liu Chenglin , Jin Lianwen , Bai Xiang , Li Xiaohui , Yin Fei . 2023 . Frontiers of intelligent document analysis and recognition: review and prospects . Journal of Image and Graphics , 28 ( 08 ): 2223 - 2252
刘成林 , 金连文 , 白翔 , 李晓辉 , 殷飞 . 2023 . 文档智能分析与识别前沿: 回顾与展望 . 中国图象图形学报 , 28 ( 08 ): 2223 - 2252 [ DOI: 10.11834/jig.221112 http://dx.doi.org/10.11834/jig.221112 .]
Liu Yuliang , Li Hongliang , Bai Xiang , Jin Lianwen . 2023 . A brief analysis of ChatGPT: historical evolution, current applications, and future prospects . Journal of Image and Graphics , 28 ( 04 ): 0893 - 0902
刘禹良 , 李鸿亮 , 白翔 , 金连文 . 2023 . 浅析ChatGPT: 历史沿革、应用现状及前景展望 . 中国图象图形学报 , 28 ( 04 ): 0893 - 0902 [ DOI: 10.11834/jig.230110 http://dx.doi.org/10.11834/jig.230110 .]
Hong Lan , Pufen Zhang . Question-guided spatial relation graph reasoning model for visual question answering [J]. Journal of Image and Graphics , 2022 , 27 ( 7 ): 2274 - 2286 .
兰红 , 张蒲芬 . 问题引导的空间关系图推理视觉问答模型 [J]. 中国图象图形学报 , 2022 , 27 ( 7 ): 2274 - 2286 . [ DOI: 10.11834/jig.200611 http://dx.doi.org/10.11834/jig.200611 .]
Wang Feng , Shi Fangyu , Zhao Jia , Zhang Xuesong , Wang Xuefeng . 2023 . Answer mask-fused visual question answering model . Journal of Image and Graphics , 28 ( 11 ): 3562 - 3574
王峰 , 石方宇 赵佳 , 张雪松 , 王雪枫 . 2023 . 融合答案掩码的视觉问答模型 . 中国图象图形学报 , 28 ( 11 ): 3562 - 3574 [ DOI: 10.11834/jig.211137 http://dx.doi.org/10.11834/jig.211137 .]
Mishra A , Shekhar S , Singh A K , et al . Ocr-vqa: Visual question answering by reading text in images [C]// 2019 international conference on document analysis and recognition (ICDAR) . IEEE , 2019 : 947 - 952 . [ DOI: 10.1109/ICDAR.2019.00156 http://dx.doi.org/10.1109/ICDAR.2019.00156 ]
Mathew M , Karatzas D , Jawahar C V . Docvqa: A dataset for vqa on document images [C]// Proceedings of the IEEE/CVF winter conference on applications of computer vision . 2021 : 2200 - 2209 . [ DOI: 10.1109/WACV48630.2021.00225 http://dx.doi.org/10.1109/WACV48630.2021.00225 ]
Ye J , Hu A , Xu H , et al . mplug-docowl: Modularized multimodal large language model for document understanding [J]. arXiv preprint arXiv:2307. 02499 , 2023 . [ DOI: 10.48550/ARXIV.2307.02499 http://dx.doi.org/10.48550/ARXIV.2307.02499 ]
Feng H , Wang Z , Tang J , et al . Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding [J]. arXiv preprint arXiv:2308. 11592 , 2023 . [ DOI: 10.48550/ARXIV.2308.11592 http://dx.doi.org/10.48550/ARXIV.2308.11592 ]
Liu Y , Yang B , Liu Q , et al . Textmonkey: An ocr-free large multimodal model for understanding document [J]. arXiv preprint arXiv:2403. 04473 , 2024 . [ DOI: 10.48550/ARXIV.2403.04473 http://dx.doi.org/10.48550/ARXIV.2403.04473 ]
Feng H , Liu Q , Liu H , et al . Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding [J]. arXiv preprint arXiv:2311. 11810 , 2023 . [ DOI: 0.48550/ARXIV.2311.11810 http://dx.doi.org/0.48550/ARXIV.2311.11810 ]
Liu H , Li C , Wu Q , et al . Visual instruction tuning [J]. Advances in neural information processing systems , 2024 , 36 .
Li J , Li D , Savarese S , et al . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [C]// International conference on machine learning . PMLR , 2023 : 19730 - 19742 .
Alayrac J B , Donahue J , Luc P , et al . Flamingo: a visual language model for few-shot learning [J]. Advances in neural information processing systems , 2022 , 35 : 23716 - 23736 .
Liu C , Yin K , Cao H , et al . Rhoda: High-resolution visual document assistant [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2024 : 15534 - 15545 . [ DOI: 10.48550/ARXIV.2404.06918 http://dx.doi.org/10.48550/ARXIV.2404.06918 ]
Peng B , Li C , He P , et al . Instruction tuning with gpt-4 [J]. arXiv preprint arXiv:2304. 03277 , 2023 . [ DOI: 10.48550/ARXIV.2304.03277 http://dx.doi.org/10.48550/ARXIV.2304.03277 ]
Shao Z , Gong Y , Shen Y , et al . Synthetic prompting: Generating chain-of-thought demonstrations for large language models [C]// International Conference on Machine Learning . PMLR , 2023 : 30706 - 30775 .
Tang J , Lin C , Zhao Z , et al . TextSquare: Scaling up Text-Centric Visual Instruction Tuning [J]. arXiv preprint arXiv:2404. 12803 , 2024 . [ DOI: 10.48550/ARXIV.2404.12803 http://dx.doi.org/10.48550/ARXIV.2404.12803 ]
Zhang Y , Zhang R , Gu J , et al . Llavar: Enhanced visual instruction tuning for text-rich image understanding [J]. arXiv preprint arXiv:2306. 17107 , 2023 . [ DOI: 10.48550/ARXIV.2306.17107 http://dx.doi.org/10.48550/ARXIV.2306.17107 ]
Wang Y , Zhou W , Feng H , et al . Towards improving document understanding: An exploration on text-grounding via mllms [J]. arXiv preprint arXiv:2311. 13194 , 2023 . [ DOI: 10.48550/ARXIV.2311.13194 http://dx.doi.org/10.48550/ARXIV.2311.13194 ]
Chen L , Li J , Dong X , et al . Sharegpt4v: Improving large multi-modal models with better captions [J]. arXiv preprint arXiv:2311. 12793 , 2023 . [ DOI: 10.48550/ARXIV.2311.12793 http://dx.doi.org/10.48550/ARXIV.2311.12793 ]
Chen G H , Chen S , Zhang R , et al . Allava: Harnessing gpt4v-synthesized data for a lite vision-language model [J]. arXiv preprint arXiv:2402. 11684 , 2024 . [ DOI: 10.48550/ARXIV.2402.11684 http://dx.doi.org/10.48550/ARXIV.2402.11684 ]
Mathew M , Bagal V , Tito R , et al . Infographicvqa [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . 2022 : 1697 - 1706 . [ DOI: 10.1109/WACV51458.2022.00264 http://dx.doi.org/10.1109/WACV51458.2022.00264 ]
Tanaka R , Nishida K , Yoshida S . Visualmrc: Machine reading comprehension on document images [C]// Proceedings of the AAAI Conference on Artificial Intelligence . 2021 , 35 ( 15 ): 13878 - 13888 . [ DOI: 10.1609/AAAI.V35I15.17635 http://dx.doi.org/10.1609/AAAI.V35I15.17635 ]
Chen X , Zhao Z , Chen L , et al . Websrc: A dataset for web-based structural reading comprehension [J]. arXiv preprint arXiv:2101. 09465 , 2021 . [ DOI: 10.18653/V1/2021.EMNLP-MAIN.343 http://dx.doi.org/10.18653/V1/2021.EMNLP-MAIN.343
Feng H , Wang Z , Tang J , et al . Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding [J]. arXiv preprint arXiv:2308.11592, 2023 .] [ DOI: 10.48550/ARXIV.2308.11592 http://dx.doi.org/10.48550/ARXIV.2308.11592 ]
Li J. , Li D. , Savarese S. , and Hoi S . 2023 . Blip-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models. // International conference on machine learning : 19730 - 19742 . [ DOI: 10.48550/arXiv.2301.12597 http://dx.doi.org/10.48550/arXiv.2301.12597 ]
Liu Y. , Li Z. , Yang B. , Li C. , Yin X. , Liu C. L. , Jin , L . and Bai , X. 2023 . On the hidden mystery of ocr in large multimodal models. [EB/OL].[ 2023-5-13 ]. https://arxiv.org/pdf/2305.07895 https://arxiv.org/pdf/2305.07895
Liu H. , Li C. , Wu Q. , and Lee Y . J . 2023 . Visual instruction tuning . // Advances in neural information processing systems , 36 . 34892 - 34916 [ DOI: 10.48550/arXiv.2304.08485 http://dx.doi.org/10.48550/arXiv.2304.08485 ]
Li Z. , Yang B. , Liu Q. , Ma Z. , Zhang S. , Yang J. , Sun Y. , Liu Y. and Bai X . 2024 . Monkey : Image resolution and text label are important things for large multi-modal models. // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition : 26763 - 26773 . [ 10.1109/CVPR52733.2024.02527 http://dx.doi.org/10.1109/CVPR52733.2024.02527 ]
Liu Y. , Yang B. , Liu Q. , Li Z. , Ma Z. , Zhang S. , and Bai X . 2024 . Textmonkey: An ocr-free large multimodal model for understanding document . [EB/OL].[ 2024-3-7 ]. https://arxiv.org/pdf/2403.04473 https://arxiv.org/pdf/2403.04473
Lin T. Y. , Maire M. , Belongie S. , Hays J. , Perona P. , Ramanan D. , Dollár P. and Zitnick C . L . 2014 . Microsoft coco : Common objects in context . // Computer Vision–ECCV 2014 : 13th European Conference: 740 - 755 [ DOI: 10.1007/978-3-319-10602-1_48 http://dx.doi.org/10.1007/978-3-319-10602-1_48 ]
Liu H. , Li C. , Li Y. , and Lee, Y. J. 2024 . Improved baselines with visual instruction tuning. // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition : 26296 - 26306 . [ DOI: 10.1109/CVPR52733.2024.02484 http://dx.doi.org/10.1109/CVPR52733.2024.02484 ]
Liu Y. , Jin L. , Zhang S. , Luo C. , and Zhang S . 2019 . Curved scene text detection via transverse and longitudinal sequence connection . Pattern Recognition , 90 , 337 - 345 . [ DOI: 10.1016/j.patcog.2019.02.002 http://dx.doi.org/10.1016/j.patcog.2019.02.002 ]
Long S. , Qin S. , Panteleev D. , Bissacco A. , Fujii Y. , and Raptis , M . 2022 . Towards end-to-end unified scene text detection and layout analysis. // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition : 1049 - 1059 . [ DOI: 10.1109/CVPR52688.2022.00112 http://dx.doi.org/10.1109/CVPR52688.2022.00112 ]
Li B , Zhang Y , Guo D , et al . Llava-onevision: Easy visual task transfer [J]. arXiv preprint arXiv:2408. 03326 , 2024 . [ DOI: 10.48550/ARXIV.2408.03326 http://dx.doi.org/10.48550/ARXIV.2408.03326 ]
Lu H , Liu W , Zhang B , et al . Deepseek-vl: towards real-world vision-language understanding [J]. arXiv preprint arXiv:2403. 05525 , 2024 . [ DOI: 10.48550/ARXIV.2403.05525 http://dx.doi.org/10.48550/ARXIV.2403.05525 ]
Laurençon H , Marafioti A , Sanh V , et al . Building and better understanding vision-language models: insights and future directions [J]. arXiv preprint arXiv: 2408.12637 , 2024 .
Masry A , Long D X , Tan J Q , et al . Chartqa: A benchmark for question answering about charts with visual and logical reasoning [J]. arXiv preprint arXiv:2203. 10244 , 2022 . [ DOI: 10.18653/V1/2022.FINDINGS-ACL.177 http://dx.doi.org/10.18653/V1/2022.FINDINGS-ACL.177 ]
Zhong X , Tang J , Yepes A J . Publaynet: largest dataset ever for document layout analysis [C]// 2019 International conference on document analysis and recognition (ICDAR) . IEEE , 2019 : 1015 - 1022 . [ DOI: 10.1109/ICDAR.2019.00166 http://dx.doi.org/10.1109/ICDAR.2019.00166 ]
Chen Z , Wu J , Wang W , et al . Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2024 : 24185 - 24198 . [ DOI: 10.48550/ARXIV.2312.14238 http://dx.doi.org/10.48550/ARXIV.2312.14238 ]
Brown T B . Language models are few-shot learners [J]. arXiv preprint arXiv: 2005.14165 , 2020 .
Hui B , Yang J , Cui Z , et al . Qwen2. 5-coder technical report [J]. arXiv preprint arXiv: 2409.12186 , 2024 .
相关文章
相关作者
相关机构