Multi-modal and Multi-label Emotion Detection for Comics Based on Two-Stream Network

被引：0

作者：

Lin Z. ^{[1
]}

Zeng B. ^{[1
]}

Pan Z. ^{[1
]}

Wen S. ^{[1
]}

机构：

[1] School of Computers, Guangdong University of Technology, Guangzhou

来源：

Zeng, Bi (zb9215@gdut.edu.cn) | 1600年 / Science Press卷 / 34期

基金：

中国国家自然科学基金;

关键词：

Comic Emotion Detection; Cosine Similarity; Multi-head Self-Attention Mechanism; Multi-modal Fusion;

D O I：

10.16451/j.cnki.issn1003-6059.202111005

中图分类号：

学科分类号：

摘要：

Comic is widely applied for metaphorizing social phenomena and expressing emotion in social media. To solve the problem of label ambiguity in multi-modal and multi-label emotion detection of comic scenes, a multi-modal and multi-label emotion detection model for comics based on two-stream network is proposed. The inter-modal information is compared using cosine similarity and combined with a self-attention mechanism to merge image features and text features. Then, the backbone of the method is a two-stream structure taking the Transformer model as the image backbone network to extract image features and taking the Roberta pre-training model as the text backbone network to extract text features. The improved cosine similarity is combined with cosine self-attention mechanism and multi-head self-attention mechanism(COS-MHSA) to extract the high-level features of the image. Finally, the multi-modal features of the high-level features and COS-MHSA are fused. The effectiveness of the proposed method is verified on EmoRecCom dataset, and the emotion detection result is presented in a visual manner. © 2021, Science Press. All right reserved.

引用

页码：1017 / 1027

页数：10

共 33 条

[1] AUGEREAU O, IWATA M, KISE K., A Survey of Comics Research in Computer Science
[2] ZHANG C, YANG Z C, HE X D, Et al., Multimodal Intelligence: Representation Learning, Information Fusion, and Applications, IEEE Journal of Selected Topics in Signal Processing, 14, 3, pp. 478-493, (2020)
[3] ZHANG Y Z, RONG L, SONG D W, Et al., A Survey on Multimodal Sentiment Analysis, Pattern Recognition and Artificial Intelligence, 33, 5, pp. 426-438, (2020)
[4] ZHANG J H, YIN Z, CHEN P, Et al., Emotion Recognition Using Multi-modal Data and Machine Learning Techniques: A Tutorial and Review, Information Fusion, 59, pp. 103-126, (2020)
[5] TZIRAKIS P, TRIGEORGIS G, NICOLAOU M A, Et al., End-to-End Multimodal Emotion Recognition Using Deep Neural Networks, IEEE Journal of Selected Topics in Signal Processing, 11, 8, pp. 1301-1309, (2017)
[6] TSAI Y H, BAI S J, LIANG P P, Et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, Proc of the 57th An-nual Meeting of the Association for Computational Linguistics, pp. 6558-6569, (2019)
[7] CIMTAY Y, EKMEKCIOGLU E., Investigating the Use of Pretrained Convolutional Neural Network on Cross-Subject and Cross-Dataset EEG Emotion Recognition, Sensors, 20, 7, (2020)
[8] LIU C X, ZOPH B, NEUMANN M, Et al., Progressive Neural Ar-chitecture Search
[9] ANASTASOPOULOS A, KUMAR S, LIAO H., Neural Language Mo-deling with Visual Features
[10] ZOPH B, LE Q V., Neural Architecture Search with Reinforcement Learning

← 1 2 3 4 →