Audio-Visual Attention Networks for Emotion Recognition

被引：6

作者：

Lee, Jiyoung ^{[1
]}

Kim, Sunok ^{[1
]}

Kim, Seungryong ^{[1
]}

Sohn, Kwanghoon ^{[1
]}

机构：

[1] Yonsei Univ, Seoul, South Korea

来源：

AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA | 2018年

基金：

新加坡国家研究基金会;

关键词：

Multimodal emotion recognition; Spatiotemporal attention; Convolutional Long Short-Term Memory; Recurrent Neural Network;

D O I：

10.1145/3264869.3264873

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We present a spatiotemporal attention based multimodal deep neural networks for dimensional emotion recognition in multimodal audio-visual video sequence. To learn the temporal attention that discriminatively focuses on emotional sailient parts within speech audios, we formulate the temporal attention network using deep neural networks (DNNs). In addition, to learn the spatiotemporal attention that selectively focuses on emotional sailient parts within facial videos, the spatiotemporal encoder-decoder network is formulated using Convolutional LSTM (ConvLSTM) modules, and learned implicitly without any pixel-level annotations. By leveraging the spatiotemporal attention, the 3D convolutional neural networks (3D-CNNs) is also formulated to robustly recognize the dimensional emotion in facial videos. Furthermore, to exploit multimodal information, we fuse the audio and video features to emotion regression model. The experimental results show that our method can achieve the state-of-the-art results in dimensional emotion recognition with the highest concordance correlation coefficient (CCC) on AV+EC 2017 dataset.

引用

页码：27 / 32

页数：6

共 50 条

[41] Leveraging recent advances in deep learning for audio-Visual emotion recognition
Schoneveld, Liam
Othmani, Alice
Abdelkawy, Hazem
[J]. PATTERN RECOGNITION LETTERS, 2021, 146 : 1 - 7
[42] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
Ma, Fei
Zhang, Wei
Li, Yang
Huang, Shao-Lun
Zhang, Lin
[J]. APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23
[43] Audio-visual affect recognition
Zeng, Zhihong
Tu, Jilin
Liu, Ming
Huang, Thomas S.
Pianfetti, Brian
Roth, Dan
Levinson, Stephen
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 424 - 428
[44] Audio-visual integration of emotion expression
Collignon, Olivier
Girard, Simon
Gosselin, Frederic
Roy, Sylvain
Saint-Amour, Dave
Lassonde, Maryse
Lepore, Franco
[J]. BRAIN RESEARCH, 2008, 1242 : 126 - 135
[45] Audio-visual gender recognition
Liu, Ming
Xu, Xun
Huang, Thomas S.
[J]. MIPPR 2007: PATTERN RECOGNITION AND COMPUTER VISION, 2007, 6788
[46] Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion Recognition
Hsu, Jia-Hao
Wu, Chung-Hsien
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (04) : 3231 - 3243
[47] Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
Praveen, R. Gnana
Cardinal, Patrick
Granger, Eric
[J]. IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2023, 5 (03): : 360 - 373
[48] MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
Zhou, Pan
Yang, Wenwen
Chen, Wei
Wang, Yanfeng
Jia, Jia
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6565 - 6569
[49] AUDIO-VISUAL FUSION AND CONDITIONING WITH NEURAL NETWORKS FOR EVENT RECOGNITION
Brousmiche, Mathilde
Rouat, Jean
Dupont, Stephane
[J]. 2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
[50] An audio-visual speech recognition system for testing new audio-visual databases
Pao, Tsang-Long
Liao, Wen-Yuan
[J]. VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +

← 1 2 3 4 5 →