方面级多模态协同注意图卷积情感分析模型

王顺杰; 蔡国永; 吕光瑞; 唐炜博

doi:10.11834/jig.221015

图像理解和计算机视觉 | 浏览量 : 0 下载量: 2 CSCD: 0

PDF
导出
分享
收藏
专辑

方面级多模态协同注意图卷积情感分析模型
Aspect-level multimodal co-attention graph convolutional sentiment analysis model
2023年28卷第12期页码：3838-3854
纸质出版日期： 2023-12-16 ，
DOI： 10.11834/jig.221015
稿件说明：

移动端阅览

王顺杰，蔡国永，吕光瑞，唐炜博. 2023. 方面级多模态协同注意图卷积情感分析模型. 中国图象图形学报， 28(12):3838-3854

Wang Shunjie， Cai Guoyong， Lyu Guangrui， Tang Weibo. 2023. Aspect-level multimodal co-attention graph convolutional sentiment analysis model. Journal of Image and Graphics， 28(12):3838-3854
王顺杰，蔡国永，吕光瑞，唐炜博. 2023. 方面级多模态协同注意图卷积情感分析模型. 中国图象图形学报， 28(12):3838-3854 DOI： 10.11834/jig.221015.

Wang Shunjie， Cai Guoyong， Lyu Guangrui， Tang Weibo. 2023. Aspect-level multimodal co-attention graph convolutional sentiment analysis model. Journal of Image and Graphics， 28(12):3838-3854 DOI： 10.11834/jig.221015.

摘要

目的

方面级多模态情感分析日益受到关注，其目的是预测多模态数据中所提及的特定方面的情感极性。然而目前的相关方法大都对方面词在上下文建模、模态间细粒度对齐的指向性作用考虑不够，限制了方面级多模态情感分析的性能。为了解决上述问题，提出一个方面级多模态协同注意图卷积情感分析模型（aspect-level multimodal co-attention graph convolutional sentiment analysis model，AMCGC）来同时建模方面指向的模态内上下文语义关联和跨模态的细粒度对齐，以提升情感分析性能。

方法

AMCGC为了获得方面导向的模态内的局部语义相关性，利用正交约束的自注意力机制生成各个模态的语义图。然后，通过图卷积获得含有方面词的文本语义图表示和融入方面词的视觉语义图表示，并设计两个不同方向的门控局部跨模态交互机制递进地实现文本语义图表示和视觉语义图表示的细粒度跨模态关联互对齐，从而降低模态间的异构鸿沟。最后，设计方面掩码来选用各模态图表示中方面节点特征作为情感表征，并引入跨模态损失降低异质方面特征的差异。

结果

在两个多模态数据集上与9种方法进行对比，在Twitter-2015数据集中，相比于性能第2的模型，准确率提高了1.76%；在Twitter-2017数据集中，相比于性能第2的模型，准确率提高了1.19%。在消融实验部分则从正交约束、跨模态损失、交叉协同多模态融合分别进行评估，验证了AMCGC模型各部分的合理性。

结论

本文提出的AMCGC模型能更好地捕捉模态内的局部语义相关性和模态之间的细粒度对齐，提升方面级多模态情感分析的准确性。

Abstract

Objective

The main task of aspect-level multimodal sentiment analysis is to determine the sentiment polarity of a given target （i.e.， aspect or entity） in a sentence by combining relevant modal data sources. This task is considered a fine-grained target-oriented sentiment analysis task. Traditional sentiment analysis mainly focuses on the content of text data. However， with the increasing amount of audio， image， video， and other media data， merely focusing on the sentiment analysis of single text data would be insufficient. Multimodal sentiment analysis surpasses traditional sentiment analysis based on a single text content in understanding human behavior and hence offers more practical significance and application value. Aspect-level multimodal sentiment analysis （AMSA） has attracted increasing application in revealing the fine-grained emotions of social users. Unlike coarse-grained multimodal sentiment analysis， AMSA not only considers the potential correlation between modalities but also focuses on guiding the aspects toward their respective modalities. However， the current AMSA methods do not sufficiently consider the directional effect of aspect words in the context modeling of different modalities and the fine-grained alignment between modalities. Moreover， the fusion of image and text representations is mostly coarse grained， thereby leading to the insufficient mining of collaborative associations between modalities and limiting the performance of aspect-level multimodal sentiment analysis. To solve these problems， the aspect-level multimodal co-attention graph convolutional sentiment analysis model （AMCGC） is proposed to simultaneously consider the aspect-oriented intra-modal context semantic association and the fine-grained alignment across the modality to improve sentiment analysis performance.

Method

AMCGC is an end-to-end aspect-level sentiment analysis method that mainly involves four stages， namely， input embedding， feature extraction， pairwise graph convolution of cross-modality alternating co-attention， and aspect mask setting. First， after obtaining the image and text embedding representations， a contextual sequence of text features containing aspect words and a contextual sequence of visual local features incorporating aspect words are constructed. To explicitly model the directional semantics of aspect words， position encoding is added to the context sequences of text and visual local features based on the aspect words. Afterward， the context sequences of different modalities are inputted into bidirectional long short-term memory networks to obtain the context dependencies of the respective modalities. To obtain the local semantic correlations of intra-modality for aspect-oriented modalities， a self-attention mechanism with orthogonal constraints is designed to generate semantic graphs for each modality. A textual semantic graph representation containing aspect words and a visual semantic graph representation incorporating aspect words are then obtained through a graph convolutional network to accurately capture the local semantic correlation within the modality. Among them， the orthogonal constraint can model the local sentiment semantic relationship of data units inside the modality as explicitly as possible and enhance the discrimination of the local features within the modality. A gated local cross-modality interaction mechanism is also designed to embed the text semantic graph representation into the visual semantic graph representation. The graph convolution network is then used again to learn the local dependencies of different modalities’ graph representations， and the text embedded in the visual semantic graph representation is inversely embedded into the text semantic graph representation so as to achieve a fine-grained cross-modality association alignment， thereby reducing the heterogeneous gap between modalities. Aspect mask settings are designed to select the aspect node features in the respective modalities’ semantic graph representation as the final sentiment representation， and cross-modal loss is introduced to reduce the differences in cross-modal aspect features.

Result

The performance of the proposed model is compared with that of nine latest methods on two public multimodal datasets. The accuracy （ACC） of the proposed model is improved by 1.76% and 1.19% on the Twitter-2015 and Twitter-2017 datasets， respectively， compared to those models with the second-highest performance. Experimental results confirm the advantage of using graph convolutional networks to model the local semantic relation interaction alignment within modalities from a local perspective and highlight the superiority of performing multimodal interaction in a cross-collaborative manner. The model is then subjected to an ablation study from the perspectives of orthogonal constraints， cross-modal loss， cross-coordinated multimodal fusion， and feature redundancy， and experiments are conducted on the Twitter-2015 and Twitter-2017 datasets. Experimental results show that the results of all ablation solutions are inferior to the performance of the AMCGC model， thus validating the rationality of each part of the AMCGC model. Moreover， the orthogonal constraint has the greatest effect， and the absence of this constraint greatly reduces in the effectiveness of the model. Specifically， removing this constraint reduces the ACC of the proposed model by 1.83% and 3.81% on the Twitter-2015 and Twitter-2017 datasets， respectively. In addition， the AMCGC+ BERT model， which is based on bidirectional encoder representation from Transformer（BERT） pre-training， outperforms the AMCGC model based on Glove. The ACC of the AMCGC+ BERT model is increased by 1.93% and 2.19% on the Twitter-2015 and Twitter-2017 datasets， thereby suggesting that the large-scale pretraining-based model has more advantages in obtaining word representations. The hyperparameters of this model are set through extensive experiments， such as determining the number of image regions and the weights of the orthogonal constraint terms. Visualization experiments prove that the AMCGC model can capture the local semantic correlation within modalities.

Conclusion

The proposed AMCGC model can efficiently capture the local semantic correlation within modalities under the constraint of orthogonal terms. This model can also effectively achieve a fine-grained alignment between multimodalities and improve the accuracy of aspect-level multimodal sentiment analysis.

关键词

多模态情感分析方面级情感分析图卷积正交约束的自注意力机制跨模态协同注意方面掩码

Keywords

multimodal sentiment analysisaspect-level sentiment analysisgraph convolutionself-attention mechanism for orthogonal constraintscross-modal co-attentionaspect mask

references

Chen J J， Hou H X， Ji Y T and Gao J. 2019a. Graph convolutional networks with structural attention model for aspect based sentiment analysis//Proceedings of 2019 International Joint Conference on Neural Networks （IJCNN）. Budapest， Hungary： IEEE： #8852093 ［DOI： 10.1109/IJCNN.2019.8852093http://dx.doi.org/10.1109/IJCNN.2019.8852093］

Chen P， Sun Z Q， Bing L D and Yang W. 2017. Recurrent attention network on memory for aspect sentiment analysis//Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen， Denmark： Association for Computational Linguistics： 452-461 ［DOI： 10.18653/v1/D17-1047http://dx.doi.org/10.18653/v1/D17-1047］

Chen Z M， Wei X S， Wang P and Guo Y W. 2019b. Multi-label image recognition with graph convolutional networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 5172-5181 ［DOI： 10.1109/CVPR.2019.00532http://dx.doi.org/10.1109/CVPR.2019.00532］

Chuang F， Qinghong G， Jiachen D， Lin G， Ruifeng X and Kam-Fai W. 2018. Convolution-based memory network for aspect-based sentiment analysis//Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. Ann Arbor， USA： Association for Computing Machinery： 1161-1164 ［10.1145/3209978.3210115http://dx.doi.org/10.1145/3209978.3210115］

Devlin J， Chang M W， Lee K and Toutanova K. 2018. BERT： pre-training of deep bidirectional Transformers for language understanding//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Minneapolis， USA： Association for Computational Linguistics： 4171-4186 ［DOI： 10.18653/v1/N19-1423http://dx.doi.org/10.18653/v1/N19-1423］

Fan F F， Feng Y S and Zhao D Y. 2018. Multi-grained attention network for aspect-level sentiment classification//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels， Belgium： Association for Computational Linguistics： 3433-3442 ［DOI： 10.18653/v1/D18-1380http://dx.doi.org/10.18653/v1/D18-1380］

Fukui H， Hirakawa T， Yamashita T and Fujiyoshi H. 2019. Attention branch network： learning of attention mechanism for visual explanation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 10697-10706 ［DOI： 10.1109/CVPR.2019.01096http://dx.doi.org/10.1109/CVPR.2019.01096］

Ge Y and Chen S C. 2020. Graph convolutional network for recommender systems. Journal of Software， 31（4）： 1101-1112

葛尧，陈松灿. 2020. 面向推荐系统的图卷积网络. 软件学报， 31（4）： 1101-1112 ［DOI： 10.13328/j.cnki.jos.005928http://dx.doi.org/10.13328/j.cnki.jos.005928］

Gu D H， Wang J Q， Cai S H， Yang C， Song Z X， Zhao H L， Xiao L W and Wang H. 2021. Targeted aspect-based multimodal sentiment analysis： an attention capsule extraction and multi-head fusion network. IEEE Access， 9： 157329-157336 ［DOI： 10.1109/ACCESS.2021.3126782http://dx.doi.org/10.1109/ACCESS.2021.3126782］

Hazarika D， Poria S， Zadeh A， Cambria E， Morency L P and Zimmermann R. 2018. Conversational memory network for emotion recognition in dyadic dialogue videos//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. New Orleans， USA： Association for Computational Linguistics： 2122-2132 ［DOI： 10.18653/v1/N18-1193http://dx.doi.org/10.18653/v1/N18-1193］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Kim J H， On K W， Lim W， Kim J， Ha J W and Zhang B T. 2016. Hadamard product for low-rank bilinear pooling//Proceedings of the 5th International Conference on Learning Representations. Toulon， France： ICLR： #04325

Kingma D P and Ba J L. 2014. Adam： a method for stochastic optimization//Proceedings of the 3rd International Conference on Learning Representations. San Diego， USA： ICLR： #6980 ［DOI： 10.48550/arXiv.1412.6980http://dx.doi.org/10.48550/arXiv.1412.6980］

Kipf T N and Welling M. 2016. Semi-supervised classification with graph convolutional networks//Proceedings of the 5th International Conference on Learning Representations. Toulon， France： ICLR：#02907

Kolesnikov A， Dosovitskiy A， Weissenborn D， Heigold G， Uszkoreit J， Beyer L and Zhai X. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale//Proceedings of the 9th International Conference on Learning Representations. ［s.l.］： ICLR： #11929

Pennington J， Socher R and Manning C D. 2014. GloVe： global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing （EMNLP）. Doha， Qatar： Association for Computational Linguistics： 1532-1543 ［DOI： 10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162］

Phan M H and Ogunbona P O. 2020. Modelling context and syntactical features for aspect-based sentiment analysis//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ［s.l.］： Association for Computational Linguistics： 3211-3220 ［DOI： 10.18653/v1/2020.acl-main.293http://dx.doi.org/10.18653/v1/2020.acl-main.293］

Sheng Z T， Chen Y X and Qi G J. 2023. Audio-visual adversarial contrastive learning-based multi-modal self-supervised feature fusion. Journal of Image and Graphics， 28（1）： 317-332

盛振涛，陈雁翔，齐国君. 2023. 面向多模态自监督特征融合的音视频对抗对比学习. 中国图象图形学报， 28（1）： 317-332 ［DOI： 10.11834/jig.220168http://dx.doi.org/10.11834/jig.220168］

Sun K， Zhang R C， Mensah S， Mao Y Y and Liu X D. 2019. Aspect-level sentiment analysis via convolution over dependency tree//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing （EMNLP-IJCNLP）. Hong Kong， China： Association for Computational Linguistics： 5679-5688 ［DOI： 10.18653/v1/D19-1569http://dx.doi.org/10.18653/v1/D19-1569］

Tang D Y， Qin B and Liu T. 2016. Aspect level sentiment classification with deep memory network//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing. Austin， Texas： Association for Computational Linguistics： 214-224 ［DOI： 10.18653/v1/D16-1021http://dx.doi.org/10.18653/v1/D16-1021］

Vaswani A， Shazeer， N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Wu L F， Zhang H， Deng S N， Shi G and Liu X. 2021. Discovering sentimental interaction via graph convolutional network for visual sentiment prediction. Applied Sciences， 11（1）： #1404 ［DOI： 10.3390/app11041404http://dx.doi.org/10.3390/app11041404］

Xing E， Jordan M， Russell S J and Ng A. 2002. Distance metric learning with application to clustering with side-information//Proceedings of International Conference on Neural Information Processing Systems 15. Vancouver， Canada： NIPS： 521-528

Xu N， Mao W J and Chen G D. 2019. Multi-interactive memory network for aspect based multimodal sentiment analysis//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu， USA： Association for the Advancement of Artificial Intelligence： 371-378 ［DOI： 10.1609/aaai.v33i01.3301371http://dx.doi.org/10.1609/aaai.v33i01.3301371］

Yang Y P and Cui X H. 2021. Bert-enhanced text graph neural network for classification. Entropy， 23（11）： #1536 ［DOI： 10.3390/e23111536http://dx.doi.org/10.3390/e23111536］

Yao R， Xia S X， Zhou Y， Zhao J Q and Hu F Y. 2021. Spatial-temporal video object segmentation with graph convolutional network and attention mechanism. Journal of Image and Graphics， 26（10）： 2376-2387

姚睿，夏士雄，周勇，赵佳琦，胡伏原. 2021. 时空图卷积网络与注意机制的视频目标分割. 中国图象图形学报， 26（10）： 2376-2387 ［DOI： 10.11834/jig.200357http://dx.doi.org/10.11834/jig.200357］

Ye Y X， Ren X Y， Zhu B， Tang T F， Tan X， Gui Y and Yao Q. 2022. An adaptive attention fusion mechanism convolutional network for object detection in remote sensing images. Remote Sensing， 14（3）： #516 ［DOI： 10.3390/rs14030516http://dx.doi.org/10.3390/rs14030516］

Yu J F and Jiang J. 2019. Adapting BERT for target-oriented multimodal sentiment classification//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao， China： IJCAI： 5408-5414 ［DOI： 10.24963/ijcai.2019/751http://dx.doi.org/10.24963/ijcai.2019/751］

Yu J F， Jiang J and Xia R. 2020. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 28： 429-439 ［DOI： 10.1109/TASLP.2019.2957872http://dx.doi.org/10.1109/TASLP.2019.2957872］

Zadeh A， Chen M H， Poria S， Cambria E and Morency L P. 2017. Tensor fusion network for multimodal sentiment analysis//Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen， Denmark： Association for Computational Linguistics： 1103-1114 ［DOI： 10.18653/v1/D17-1115http://dx.doi.org/10.18653/v1/D17-1115］

Zhang C， Li Q C and Song D W. 2019. Aspect-based sentiment classification with aspect-specific graph convolutional networks//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong， China： Association for Computational Linguistics： 4568-4578 ［DOI： 10.18653/v1/D19-1464http://dx.doi.org/10.18653/v1/D19-1464］

Zhang M and Qian T Y. 2020. Convolution over hierarchical syntactic and lexical graphs for aspect level sentiment analysis//Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. ［s.l.］： Association for Computational Linguistics： 3540-3549 ［DOI： 10.18653/v1/2020.emnlp-main.286http://dx.doi.org/10.18653/v1/2020.emnlp-main.286］

Zhou J， Zhao J B， Huang J X， Hu Q V and He L. 2021. MASAD： a large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing， 455： 47-58 ［DOI： 10.1016/j.neucom.2021.05.040http://dx.doi.org/10.1016/j.neucom.2021.05.040］

文章被引用时，请邮件提醒。

提交

采用蒸馏训练的时空图卷积动作识别融合模型

融合时空图卷积的多人交互行为识别