融合答案掩码的视觉问答模型

王峰; 石方宇; 赵佳; 张雪松; 王雪枫

doi:10.11834/jig.211137

图像理解和计算机视觉 | 浏览量 : 0 下载量: 1 CSCD: 0

PDF
导出
分享
收藏
专辑

融合答案掩码的视觉问答模型
Answer mask-fused visual question answering model
2023年28卷第11期页码：3562-3574
纸质出版日期： 2023-11-16 ，
DOI： 10.11834/jig.211137
稿件说明：

移动端阅览

王峰，石方宇，赵佳，张雪松，王雪枫. 2023. 融合答案掩码的视觉问答模型. 中国图象图形学报， 28(11):3562-3574

Wang Feng， Shi Fangyu， Zhao Jia， Zhang Xuesong， Wang Xuefeng. 2023. Answer mask-fused visual question answering model. Journal of Image and Graphics， 28(11):3562-3574
王峰，石方宇，赵佳，张雪松，王雪枫. 2023. 融合答案掩码的视觉问答模型. 中国图象图形学报， 28(11):3562-3574 DOI： 10.11834/jig.211137.

Wang Feng， Shi Fangyu， Zhao Jia， Zhang Xuesong， Wang Xuefeng. 2023. Answer mask-fused visual question answering model. Journal of Image and Graphics， 28(11):3562-3574 DOI： 10.11834/jig.211137.

摘要

目的

现有的视觉问答模型由于受到语言先验的影响，预测准确率不高。虽然模型能够根据数据集中问题和答案的统计规律学习到它们之间简单的对应关系，但无法学习到问题和答案类型之间深层次的对应关系，容易出现答非所问的现象。为此，提出了一种使用答案掩码对预测结果中的无关答案进行遮盖的方法，迫使模型关注问题和答案类型之间的对应关系，提高模型的预测准确率。

方法

首先对数据集中的答案进行聚类并为每一类答案生成不同的答案掩码，然后使用预训练的答案类型识别模型预测问题对应的答案类型，并根据该模型的预测结果选择相应的答案掩码对基线模型的预测结果进行遮盖，最终得到正确答案。

结果

提出的方法使用UpDn（bottom-up and top-down ）、RUBi（reducing unimodal biases ）、LMH（learned-mixin +h ）和CSS（counterfactual samples synthesizing ）4种模型作为基线模型，在3个大型公开数据集上进行实验。在VQA（visual question answer）-CP v2.0数据集上的实验结果表明，本文方法使UpDn模型的准确率提高了2.15%，LMH模型的准确率提高了2.29%，融合本方法的CSS模型的准确率达到了60.14%，较原模型提升了2.02%，达到了目前较高的水平。在VQA v2.0和VQA-CP v1.0数据集上的结果也显示本文方法提高了大多数模型的准确率，具有良好的泛化性。此外，在VQA-CP v2.0上的消融实验证明了本文方法的有效性。

结论

提出的方法通过答案掩码对视觉问答模型的预测结果进行遮盖，减少无关答案对最终结果的影响，使模型学习到问题和答案类型之间的对应关系，有效改善了视觉问答模型答非所问的现象，提高了模型的预测准确率。

Abstract

Objective

Visual question answering （VQA） is essential for artificial intelligence （AI） in recent years. Current VQA is concerned of the linkage of natural language processing and computer vision more. Therefore， VQA-related model is required for text and image information processing simultaneously， and the information of these two modes can be fused to infer the answer. Such popular VQA models like VQA v2.0 dataset have been developing in terms of a deep neural network and trained samples. However， these prior language models-based tasks can be simplified to learn the surface relationship for answer questions between questions and answers. The weakness of uneven distribution of answers is still to be challenged for its weak generalization and poor performance in the VQA-CP v2.0 dataset. Specifically， language problems-prior has threatened for its prediction errors of the model and the predicted answer and question are in irrelevance. To optimize this non-linkage and generalization of the model， we develop an answer mask-related method to cover the irrelevant answers for predictable results， which can forge the model to learn the deeper relationship between question and answer. The prediction accuracy of the model can be improved as well.

Method

The prediction results of the baseline model is masked via the answer mask. It is necessary to cluster all candidate answers and fewer answers-involved for each type of answer can be used to preserve accurate classification through more answers-irrelevant coverage of mask-of the prediction results. The answers consist of non-contextual words and phrases. Conventional Word2Vec and Glove is still challenged for its effectiveness of these encoded answers. Such clip is illustrated as the encoder to extract the answer features. And， the k-means algorithm is used to cluster answer-extracted feature vectors. After clustering， original dataset can be modified and the corresponded type is changed to the clustering-after type answer of the dataset， and different answer mask vectors are generated for each answer type. The answer mask vector is structured of 0 and 1. The elements of the vector can be assigned to 1 when the corresponding positions are contained for each answer type， and the others are configured to 0； the impact of irrelevant answers of prediction can be eliminated for final results of the baseline model. We design an answer type recognition model， which uses the questions and answers types for pre-training. Input question-based model can be used to predict the answer type corresponding to the question. The model’s accuracy can reflect the quality of clustering work， and its prediction results are the basis for the optioned answers mask types- task. The baseline model is focused on encoding the image and text and depth neural network is linked to fuse the image and text features. The preliminary prediction results can be obtained through the classifier as well. First， corresponding answer mask vector is leaked out in terms of answer type identification model-based prediction results. Then， the multiplied prediction results are generated via the baseline model and the distribution of irrelevant answers are covered in the prediction results of the baseline model. At the end， final results are predicted. The model is trained to learn the corresponding relationship between the types of questions and answers.

Result

We selected out UpDn， RUBi， LMH and CSS as baseline models and experiments are carried out on three large public datasets mentioned below. VQA-CP v2.0 dataset-related experiments can show its potentials for model’s accuracy. Three sort of accuracy of the UpDn， LMH and CSS model are improved by 2.15%， 2.29% and 2.02% each. Among them， the higher accuracy of the CSS model is reached to 60.14%. Additionally， our model's accuracy is preserved when VQA v2.0-related accuracy is reduced. The VQA v2.0-based experimental results show that the accuracy of most baseline models are improved further. Among them， the accuracy of the CSS model is optimized by 3.18%. To demonstrate better generalization of our model， comparative experiments are carried out on VQA-CP v1.0 dataset further. The experimental results show that our method is mutual-benefited for most of baseline models， which is sufficient to reflect its potential ability of generalization. Furthermore， ablation experiment on VQA-CP v2.0 shows that the accuracy can be optimized further in terms of the answer mask.

Conclusion

We develop an answer mask-related method to cover irrelevant answers in the model prediction results and the final influence of irrelevant answers can be alleviated. The model is yielded to learn the corresponding relationship between the question and the answer type， and its challenge can be resolved for the question-irrelevant model's prediction answer to a certain extent ， and the model’s generalization and accuracy can be optimized as well.

关键词

视觉问答语言先验答案聚类答案掩码答案类型识别

Keywords

visual question answering （VQA）language priorsanswer clusteringanswer maskanswer type recognition

references

Agrawal A， Batra D and Parikh D. 2016. Analyzing the behavior of visual question answering models//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin， Texas， USA： Association for Computational Linguistics： 1-13 ［DOI： 10.18653/v1/D16-1203http://dx.doi.org/10.18653/v1/D16-1203］

Agrawal A， Batra D， Parikh D and Kembhavi A. 2018. Don’t just assume； look and answer： overcoming priors for visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 4971-4980 ［DOI： 10.1109/CVPR.2018.00522http://dx.doi.org/10.1109/CVPR.2018.00522］

Anderson P， He X D， Buehler C， Teney D， Johnson M， Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 6077-6086 ［DOI： 10.1109/CVPR.2018.00636http://dx.doi.org/10.1109/CVPR.2018.00636］

Antol S， Agrawal A， Lu J S， Mitchell M， Batra D， Zitnick C L and Parikh D. 2015. VQA： visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 2425-2433 ［DOI： 10.1109/ICCV.2015.279http://dx.doi.org/10.1109/ICCV.2015.279］

Ben-Younes H， Cadene R， Cord M and Thome N. 2017. MUTAN： multimodal tucker fusion for visual question answering//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2631-2639 ［DOI： 10.1109/ICCV.2017.285http://dx.doi.org/10.1109/ICCV.2017.285］

Cadène R， Dancette C， Ben-younes H， Cord M and Parikh D. 2019. RUBi： reducing unimodal biases for visual question answering//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver， Canada：［s.n.］

Chen L， Yan X， Xiao J， Zhang H W， Pu S L and Zhuang Y T. 2020. Counterfactual samples synthesizing for robust visual question answering//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 10797-10806 ［DOI： 10.1109/CVPR42600.2020.01081http://dx.doi.org/10.1109/CVPR42600.2020.01081］

Cho K， van Merriënboer B， Gulcehre C， Bahdanau D， Bougares F， Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha， Qatar： Association for Computational Linguistics： 1724-1734 ［DOI： 10.3115/v1/D14-1179http://dx.doi.org/10.3115/v1/D14-1179］

Clark C， Yatskar M and Zettlemoyer L. 2019. Don't take the easy way out： ensemble based methods for avoiding known dataset biases//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong， China： Association for Computational Linguistics： 1-14 ［DOI： 10.18653/v1/D19-1418http://dx.doi.org/10.18653/v1/D19-1418］

Devlin J， Chang M W， Lee K and Toutanova K. 2019. BERT： Pre-training of deep bidirectional transformers for language understanding//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Minneapolis， Minnesota， USA： Association for Computational Linguistics： 1-16 ［DOI： 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423］

Fukui A， Park D H， Yang D， Rohrbach A， Darrell T and Rohrbach M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin， Texas， USA： Association for Computational Linguistics： 457-468 ［DOI： 10.18653/v1/D16-1044http://dx.doi.org/10.18653/v1/D16-1044］

Gokhale T， Banerjee P， Baral C and Yang Y Z. 2020. Mutant： a training paradigm for out-of-distribution generalization in visual Question answering//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. ［s.l.］： Association for Computational Linguistics： #63 ［DOI： 10.18653/v1/2020.emnlp-main.63http://dx.doi.org/10.18653/v1/2020.emnlp-main.63］

Goyal Y， Khot T， Summers-Stay D， Batra D and Parikh D. 2017. Making the V in VQA matter： elevating the role of image understanding in visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 6325-6334 ［DOI： 10.1109/CVPR.2017.670http://dx.doi.org/10.1109/CVPR.2017.670］

Guo YY， Nie LQ， Cheng ZY and Tian Q. 2022. Loss re-scaling VQA： revisiting the language prior problem from a class-imbalance view. IEEE Transactions on Image Processing， 31： 227-238 ［DOI： 10.1109/TIP.2021.3128322http://dx.doi.org/10.1109/TIP.2021.3128322］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Jabri A， Joulin A and van der Maaten L. 2016. Revisiting visual question answering baselines//Proceedings of the 14th European Conference on Computer Vision. Amsterdam， The Netherlands： Springer International Publishing： 727-739 ［DOI： 10.1007/978-3-319-46484-8_44http://dx.doi.org/10.1007/978-3-319-46484-8_44］

Jia Y P. 2020. Visual Question Answering Model Based on Answer Type Prediction. Harbin： Harbin Institute of Technology

贾荫鹏. 2020. 基于答案类型预测的视觉问答模型. 哈尔滨：哈尔滨工业大学

Jing C C， Wu Y W， Zhang X X， Jia Y D and Wu Q. 2020. Overcoming language priors in vqa via decomposed linguistic representations. Proceedings of the AAAI Conference on Artificial Intelligence， 34（7）： 11181-11188 ［DOI： 10.1609/aaai.v34i07.6776http://dx.doi.org/10.1609/aaai.v34i07.6776］

Kafle K and Kanan C. 2017. Visual question answering： datasets， algorithms， and future challenges. Computer Vision and Image Understanding， 163： 3-20 ［DOI： 10.1016/j.cviu.2017.06.005http://dx.doi.org/10.1016/j.cviu.2017.06.005］

Li L J， Gan Z， Cheng Y and Liu J J. 2019. Relation-aware graph attention network for visual question answering//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 10312-10321 ［DOI： 10.1109/ICCV.2019.01041http://dx.doi.org/10.1109/ICCV.2019.01041］

Liang Z J， Jiang W T， Hu H F and Zhu J Y. 2020. Learning to contrast the counterfactual samples for robust visual question answering//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. ［s.l.］： Association for Computational Linguistics： 3285-3292 ［DOI： 10.18653/v1/2020.emnlp-main.265http://dx.doi.org/10.18653/v1/2020.emnlp-main.265］

Lin T Y， Maire M， Belongie S， Hays J， Perona P， Ramanan D， Dollár P and Zitnick C L. 2014. Microsoft COCO： common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich， Switzerland： Springer： 740-755 ［DOI： 10.1007/978-3-319-10602-1_48］.

Lu J S， Yang J W， Batra D and Parikh D. 2016. Hierarchical question-image co-attention for visual question answering//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona， Spain： Curran Associates Inc： 289-297

Mikolov T， Chen K， Corrado G and Dean J. 2013. Efficient estimation of word representations in vector space//Proceedings of the 1st International Conference on Learning Representations. Scottsdale， USA：［s.n.］

Pennington J， Socher R and Manning C. 2014. Glove： global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha， Qatar： Association for Computational Linguistics： 1532-1543 ［DOI： 10.3115/v1/D14-1162］.

Radford A， Kim J W， Hallacy C， Ramesh A， Goh G， Agarwal S， Sastry G， Askell A， Mishkin P and Clark J. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. ［s.l.］： PMLR

Ren S Q， He K M， Girshick R and Sun J. 2017. Faster R-CNN： towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（6）： 1137-1149 ［DOI： 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031］

Si Q Y， Lin Z， Zheng M Y， Fu P and Wang W P. 2021. Check it again： progressive visual question answering via visual entailment//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ［s.l.］： Association for Computational Linguistics： 4101-4110 ［DOI： 10.18653/v1/2021.acl-long.317http://dx.doi.org/10.18653/v1/2021.acl-long.317］

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition//Proceedings of the 3rd International Conference on Learning Representations. San Diego， USA：［s.n.］

Szegedy C， Liu W， Jia Y Q， Sermanet P， Reed S， Anguelov D， Erhan D， Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 1-9 ［DOI： 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594］

Tan H and Bansal M. 2019. LXMERT： Learning cross-modality encoder representations from transformers//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong， China： Association for Computational Linguistics： 1-12 ［DOI： 10.18653/v1/D19-1514http://dx.doi.org/10.18653/v1/D19-1514］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc： 1-15

Wu Q， Wang P， Shen C H， Dick A and van den Hengel A. 2016. Ask me anything： free-form visual question answering based on knowledge from external sources//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 4622-4630 ［DOI： 10.1109/CVPR.2016.500http://dx.doi.org/10.1109/CVPR.2016.500］

Yang Z C， He X D， Gao J F， Deng L and Smola A. 2016. Stacked attention networks for image question answering//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 21-29 ［DOI： 10.1109/CVPR.2016.10http://dx.doi.org/10.1109/CVPR.2016.10］

Yu D F， Fu J L， Mei T and Rui Y. 2017. Multi-level attention networks for visual question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 4187-4195 ［DOI： 10.1109/CVPR.2017.446http://dx.doi.org/10.1109/CVPR.2017.446］

Zhou B L， Tian Y D， Sukhbaatar S， Szlam A and Fergus R. 2015. Simple baseline for visual question answering［EB/OL］. ［2021-11-19］. https://arxiv.org/pdf/1512.02167.pdfhttps://arxiv.org/pdf/1512.02167.pdf

Zhu X， Mao Z D， Liu C X， Zhang P， Wang B and Zhang Y D. 2020. Overcoming language priors with self-supervised learning for visual question answering//Proceedings of the 29th International Joint Conference on Artificial Intelligence. Yokohama， Japan： IJCAI.org： 1-7

文章被引用时，请邮件提醒。

提交

暂无数据