结合提示学习和视觉语义-生成融合的东巴画图像描述
Prompts learning and visual semantic-generation fusion for Dongba paintings image captioning
- 2025年 页码:1-15
收稿日期:2024-10-05,
修回日期:2025-02-12,
录用日期:2025-02-25,
网络出版日期:2025-02-26
DOI: 10.11834/jig.240601
移动端阅览
浏览全部资源
扫码关注微信
收稿日期:2024-10-05,
修回日期:2025-02-12,
录用日期:2025-02-25,
网络出版日期:2025-02-26,
移动端阅览
目的
2
东巴画是纳西族传统艺术的瑰宝,其画面视觉元素丰富、色彩分明,具有鲜明的地域文化特色和民族特征。针对现有图像描述方法在东巴画描述中存在的领域偏移问题,本文提出了一种结合提示学习和视觉语义-生成融合的东巴画图像描述方法。该方法引入内容提示模块和视觉语义-生成融合损失,旨在引导模型学习东巴画的主题信息,提升描述的准确性和文化表达能力。
方法
2
采用编-解码器(encoder-decoder)架构实现东巴画图像描述的生成。编码器采用卷积神经网络(convolutional neural networks,CNN)捕获图像中关键的语义信息,并将这些特征整合到解码器编码层中的归一化层,控制文本描述的生成过程。解码器采用Transformer结构实现,利用自注意力机制有效地捕捉输入序列中的长距离依赖关系,使模型关注输入序列中的关键信息。此外,本文在解码器之前引入了内容提示模块。该模块通过图像特征向量得到图像的主体、动作等信息,并将其构建成提示信息作为描述文本的后置提示。通过后置提示信息,解码器能有效地关注描述文本中具体的文化场景和细节特征,增强对东巴画特定图案和场景的识别与理解能力。同时,本文引入了视觉语义-生成融合损失,通过优化该损失,引导模型提取东巴画中的关键信息,从而生成与图像保持高度一致的描述文本。
结果
2
实验结果表明,在东巴画测试集上,本文所提方法在BLEU(bilingual evaluation understudy)_1到BLEU_4、METEOR(metric for evaluation with explicit ordering)、ROUGE(recall-oriented understudy for gisting evaluation)和CIDEr(consensus-based image description evaluation)评价指标上分别达到了0.603、0.426、0.317、0.246、0.256、0.403和0.599,东巴画图像描述文本在主观质量也得到了更好的效果。
结论
2
本文所提方法显著增强了模型对东巴画图像主题和民族文化特征的捕捉能力,有效提升了生成描述在准确性、语义关联性和表达流畅性方面的表现。
Objective
2
Dongba painting is the treasure of Naxi traditional art. Its visual elements are rich, colors are distinct, and it has strong regional cultural and national characteristics. The application of existing image description methods to Dongba paintings is hindered by domain bias, with the model generating text features aligned with those of natural images.The conventional approaches to image description are contingent upon the use of predefined sentence structures and the quantity and quality of data sets. The descriptions generated by such methods are often repetitive and lack diversity. When directly applied to Dongba painting descriptions, these methods are found to have certain limitations. Multi-task pre-training models jointly train multiple vision-language objectives, including image-text contrastive learning, image-text matching, and image-conditioned language modeling.However, these image captioning models tend to generate generic descriptions and are unable to capture the distinctive style present in ethnic images. The majority of controlled image captioning models regulate the generation of image descriptions through the analysis of textual data and the extraction of keywords or entities.Nevertheless, keyword extraction is contingent upon the explicit entity information present within the sentence. It is unable to account for the implicit cultural connotations and semantic depth inherent to Dongba paintings.Consequently, the resulting descriptions are simplistic and one-dimensional. To address these issues, this paper propose an image captioning method based on prompts learning and visual-semantic fusion. This approach guides the model in acquiring the cultural background knowledge associated with Dongba paintings, thereby mitigating the domain bias challenge.
Method
2
First, this paper uses a codec architecture to realize image caption for Dongba paintings. In the encoding stage, the encoder employs a convolutional neural network to extract the essential semantic information from the Dongba painting image. This information is then integrated into the normalization layer within the coding layer of the decoder, enabling the control of the process of generating textual descriptions. The resulting descriptions are therefore semantically aligned with the image content.In the decoding stage, Transformer structure is employed to efficiently capture long-distance dependencies in the input sequence through a self-attention mechanism.This facilitates the generation of accurate and coherent text output.To circumvent overfitting resulting from the stacking of decoder layers, the proposed model employs a decoder comprising 10 layers.Additionally, pretrained BERT weights are utilized for initialization, thereby enhancing the model's performance and convergence speed.Secondly, we introduce a visual semantic-generative fusion loss, which is optimized to guide the model to extract the key information in the Dongba paintings, so as to generate descriptive texts that are highly consistent with the images.Meanwhile, the content prompt module is introduced before the decoder.Through a mapper, the cultural background information such as religious patterns, gods and spirits and hell ghosts in Dongba painting can be obtained.By combining prompt information and text information, the decoder can more effectively understand and use the content information of the image.This improves the relevance and accuracy of the generated description text and the image theme.Finally, to enhance the model's capacity for targeted learning and description of painting image features, a bespoke dataset of Dongba painting image caption has been constructed.This is based on the themes of Dongba paintings and the underlying Dongba culture, which are used for model training.
Result
2
A series of experiments are carried out on the Dongba painting test dataset in this paper to verify the effectiveness of the proposed method. The experiment included contrast experiment and ablation experiment.In the ablation experiments, the architectural configuration of the encoder and decoder is retained, and the ablation model is constructed in accordance with this configuration.In comparison experiment, three kinds of comparison experiment were formed. In the first part, the effect of different model encoders on image feature extraction on image description text quality is compared.In the second part, the effect of the number of coding layers of model decoder on the quality of generated description text is discussed. In the third part, some advanced algorithms are selected for comparison to verify the superiority of the proposed algorithms.The experimental results show that the proposed method achieves 0.603、0.426、0.317、0.246、0.256、0.403 and 0.599 respectively in the evaluation indexes of BLEU-1 to BLEU-4, METEOR and CIDEr, and also achieves better results in the subjective quality of Dongba painting image description text.
Conclusion
2
The image captioning method based on prompt learning and visual semantic-generation fusion for Dongba paintings proposed in this paper effectively enhances the encoder's ability to capture image features as well as the decoder's ability to understand the input text. This results in descriptions that are more relevant to the semantics of the images, and more accurate and fluid. Furthermore, the method preserves the national characteristics of Dongba paintings. In the future, the relationships between the elements in Dongba paintings will be explored in depth.Furthermore,the size and diversity of the dataset will be expanded ,and the model structure will be optimized to enhance its robustness. Other datasets will be also used for training purposes, thereby expanding the scope of application.
Anderson P , He X D , Buehler C , Teney D , Johnson M , Gould S and Zhang L . 2018 . Bottom-up and top-down attention for image captioning and visual question answering // Proceedings of 2018 IEEE conference on computer vision and pattern recognition . Salt Lake City, USA : IEEE: 6077 – 6086 [ DOI: 10.1109/CVPR. 2018. 00636 http://dx.doi.org/10.1109/CVPR.2018.00636 ]
Banerjee S and Lavie A . 2005 . METEOR: An automatic metric for MT evaluation with improved correlation with human judgments // Proc- eedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization . ANN Arbor, Michigan : ACL: 65 – 72
Devlin J , Chang M W , Lee K and Toutanova K . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. [ 2024-08-01 ]. https://arxiv.org/pdf/1810.04805 https://arxiv.org/pdf/1810.04805
Devlin J , Gupta S , Girshick R , Mitchell M and Zitnick C L . 2015 . Exploring nearest neighbor approaches for image captioning [EB/OL]. [ 2024-08-01 ]. https://arxiv.org/pdf/1505.04467 https://arxiv.org/pdf/1505.04467
Farhadi A , Hejrati M , Sadeghi M A , Young P , Rashtchian C , Hockenmaier J and Forsyth D . 2010 . Every picture tells a story: Generating sentences from images // Proceedings of the 11th European Conference on Computer Vision . Berlin, Germany : Springer: 15 – 29 [ DOI: 10.1007/978-3-642-15561-1_2 http://dx.doi.org/10.1007/978-3-642-15561-1_2 ]
Fei J J , Wang T , Zhang J R , He Z Y , Wang C J and Zheng F . 2023 . Transferable decoding with visual entities for zero-shot image captioning // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Paris, France : IEEE: 3136 – 3146 [ DOI: 10.1109/iccv51070.2023.00291 http://dx.doi.org/10.1109/iccv51070.2023.00291 ]
Johnson J , Karpathy A and Li F F . 2016 . Densecap: Fully convolutional localization networks for dense captioning/Proceedings of 2016 IEEE conference on computer vision and pattern recognition . Las Vegas, USA : IEEE : 4565 – 4574 [ DOI: 10.1109/CVPR. 2016. 494 http://dx.doi.org/10.1109/CVPR.2016.494 ]
Li J N , Li D X , Xiong C M and Hoi S . 2022 . BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation // Proceedings of the 39th International Conference on Machine Learning . Baltimore, Maryland, USA : PMLR: 12888 – 12900
Li S , Kulkarni G , Berg T , Berg A and Choi Y . 2011 . Composing simple image descriptions using web-scale n-grams // Proceedings of the fifteenth conference on computational natural language learning . Portland, USA : Association for Computational Linguistics: 220 – 228
Li W , Zhu L , Wen L C and Yang Y . 2023 . Decap: Decoding clip latents for zero-shot captioning via text-only training [EB/OL]. [ 2024-08-01 ]. https://arxiv.org/abs/2303.03032 https://arxiv.org/abs/2303.03032
Lin C Y . 2004 . Rouge : A package for automatic evaluation of summaries //Text summarization branches out. 74 – 81
Loshchilov I and Hutter F . 2017 . Decoupled weight decay regularization [EB/OL]. [ 2024-08-01 ]. https://arxiv.org/pdf/1711. 05101 https://arxiv.org/pdf/1711.05101
Mokady R , Hertz A and Bermano A H . 2021 . Clipcap: Clip prefix for image captioning [EB/OL]. [ 2024-08-01 ]. https://arxiv.org/pdf/21- 11.09734 https://arxiv.org/pdf/21-11.09734
Papineni K , Roukos S , Ward T and Zhu W J . 2002 . Bleu: a method for automatic evaluation of machine translation // Proceedings of the 40th annual meeting of the Association for Computational Linguistics . Philadelphia, USA : Association for Computational Linguistics: 311 – 318 [ DOI: 10.3115/1073083.1073135 http://dx.doi.org/10.3115/1073083.1073135 ]
Qian W H , Xu D , Xu J , He L and Zhang B . Simulation of dongba art style painting . Journal of System Simulation , 2020 , 32 : 1349 - 1359
钱文华 , 徐丹 , 徐瑾 , 何磊和张波 . 2020 . 东巴画艺术风格绘制 . 系统仿真学报 . 32 : 1349 – 1359 [ DOI: 10.16182/j.issn1004731x. joss.19-VR0530 http://dx.doi.org/10.16182/j.issn1004731x.joss.19-VR0530 ]
Qiu L T , Ning S and He X M . 2024 . Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training // Proceedings of the 38th AAAI Conference on Artificial Intelligence . Vancouver, Canada : AAAI: 4605 – 4613 [ DOI: 10.1609/aaai.v38i5.28260 http://dx.doi.org/10.1609/aaai.v38i5.28260 ]
Sandler M , Howard A , Zhu M l , Zhmoginov A and Chen L C . 2018 . Mobilenetv2: Inverted residuals and linear bottlenecks /Proceedings of 2018 IEEE conference on computer vision and pattern recognition . Salt Lake City, USA : IEEE : 4510 – 4520 [ DOI: 10.1109/cvpr.2018.00474 http://dx.doi.org/10.1109/cvpr.2018.00474 ]
Su Q . The symbiosis of aesthetic imagery: An examination of the origins of modeling style in Dongba paintings . Gansu Social Sciences , 2022 , 5 : 98 – 104 .
苏泉 . 2022 . 审美意象的交融共生:论东巴画造型风格的衍进 . 甘肃社会科学 , 5 : 98 - 104 [ DOI: 10.15891/j.cnki.cn62-1093/c.2022.05.018 http://dx.doi.org/10.15891/j.cnki.cn62-1093/c.2022.05.018 ]
Tang P J , Tan Y L and Li J Z . Image description based on the fusion of scene and object category prior knowledge . Journal of Image and Graphics , 2017 , 22 ( 9 ): 1251 - 1260
汤鹏杰 , 谭云兰和李金忠 . 2017 . 融合图像场景及物体先验知识的图像描述生成模型 . 中国图象图形学报 , 22 ( 9 ): 1251 - 1260 [ DOI: 10.11834/jig. 170052 http://dx.doi.org/10.11834/jig.170052 ]
Vaswani A , Shazeer N , Parmar N , Uszkoreit J , Jones L , Gomez A N , Kaiser Ł and Polosukhin I . 2017 . Attention is All you Need // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach, USA : Curran Associates, Inc: 6000 – 6010 .
Vedantam R , Lawrence Zitnick C and Parikh D . 2015 . Cider: Consensus-based image description evaluation // Proceedings of 2015 IEEE conference on computer vision and pattern recognition . Boston, USA : IEEE: 4566 – 4575 [ DOI: 10.1109/CVPR.2015.7299 087 http://dx.doi.org/10.1109/CVPR.2015.7299087 ]
Vinyals O , Toshev A , Bengio S and Erhan D . 2015 . Show and tell: A neural image caption generator // Proceedings of 2015 IEEE conference on computer vision and pattern recognition . Boston, USA : IEEE: 3156 – 3164 [ DOI: 10.1109/CVPR.2015.7298935 http://dx.doi.org/10.1109/CVPR.2015.7298935 ]
Waghmare P M and Shinde S V . 2022 . Image Caption Generation Using Neural Network Models and LSTM Hierarchical Structure //Das AK, Nayak J, Naik B, Dutta S and Pelusi D, eds. Computational Intelligence in Pattern Recognition . Singapore : Springer Singapore: 109 – 117 [ DOI: 10.1007/978-981-16-2543-5_10 http://dx.doi.org/10.1007/978-981-16-2543-5_10 ]
Wang N , Xie J H , Wu J H , Jia M B and Li L L . 2023 . Controllable Image Captioning via Prompting // Proceedings of the 37th AAAI Conference on Artificial Intelligence . Washington DC, USA : AAAI: 2617 – 25 [ DOI: 10.1609/aaai.v37i2.25360 http://dx.doi.org/10.1609/aaai.v37i2.25360 ]
Wang P , Yang A , Men R , Lin J Y , Bai S , Li Z K , Ma J X , Zhou C , Zhou J R AND Yang H X , 2022 . OFA : Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework// Proceedings of the 39th International Conference on Machine Learning . Baltimore, USA : PMLR: 23318 – 40
Yang H R . Study on the Artistic Features of Naxi Dongba Pictographic Scripts . Chinese Minzu Art , 2021 , 3 : 60 – 63
杨鸿荣 . 2021 . 纳西族东巴画谱艺术特征研究 . 中国民族美术 , 3 : 60 – 63
Yu J R , Li H R , Hao Y B , Zhu B , Xu T and He X N . 2023 . CgT-GAN: CLIP-guided Text GAN for Image Captioning // Proceedings of the 31st ACM International Conference on Multimedia . New York, USA : Association for Computing Machinery: 2252 – 63 [ DOI: 10.1145/3581783.3611891 http://dx.doi.org/10.1145/3581783.3611891 ]
Zeng Z Q , Xie Y , Zhang H , Chen C Y , Chen B and Wang Z J . 2024 . MeaCap: Memory-Augmented Zero-shot Image Captioning/ /Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle, Washington : IEEE : 14100 – 110 [ DOI: 10.1109/cvpr52733.2024.01337 http://dx.doi.org/10.1109/cvpr52733.2024.01337 ]
Zhao Y Q , Jin Z , Zhang F , Zhao H Y , Tao Z W , Dou C F , Xu X H and Liu D H . Deep-learning-based image captioning:analysis and prospects . Journal of Image and Graphics , 2023 , 28 ( 09 ): 2788 - 2816
赵永强 , 金芝 , 张峰 , 赵海燕 , 陶政为 , 豆乘风 , 徐新海和刘东红 . 2023 . 深度学习图像描述方法分析与展望 . 中国图象图形学报 , 28 ( 09 ): 2788 - 2816 [ DOI: 10. 11834/jig. 220660 http://dx.doi.org/10.11834/jig.220660 ]
Zheng Y , Li Y L and Wang S J . 2019 . Intention oriented image captions with guiding objects // Proceedings of 2019 IEEE/CVF conference on computer vision and pattern recognition . Long Beach, USA : IEEE: 8395 – 8404 [ DOI: 10.1109/cvpr.2019.00859 http://dx.doi.org/10.1109/cvpr.2019.00859 ]
Zhou L W , Palangi H , Zhang L , Hu H D , Corso J and Gao J F . 2020 . Unified vision-language pre-training for image captioning and vqa // Proceedings of 34th AAAI Conference on Artificial Intelligence . New York, USA : AAAI: 13041 – 49 [ DOI: 10.1609/ AAAI.v34i07.7005 http://dx.doi.org/10.1609/AAAI.v34i07.7005 ]
相关作者
相关机构