Audio-Visual Attention Networks for Emotion Recognition

被引:6
|
作者
Lee, Jiyoung [1 ]
Kim, Sunok [1 ]
Kim, Seungryong [1 ]
Sohn, Kwanghoon [1 ]
机构
[1] Yonsei Univ, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
Multimodal emotion recognition; Spatiotemporal attention; Convolutional Long Short-Term Memory; Recurrent Neural Network;
D O I
10.1145/3264869.3264873
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present a spatiotemporal attention based multimodal deep neural networks for dimensional emotion recognition in multimodal audio-visual video sequence. To learn the temporal attention that discriminatively focuses on emotional sailient parts within speech audios, we formulate the temporal attention network using deep neural networks (DNNs). In addition, to learn the spatiotemporal attention that selectively focuses on emotional sailient parts within facial videos, the spatiotemporal encoder-decoder network is formulated using Convolutional LSTM (ConvLSTM) modules, and learned implicitly without any pixel-level annotations. By leveraging the spatiotemporal attention, the 3D convolutional neural networks (3D-CNNs) is also formulated to robustly recognize the dimensional emotion in facial videos. Furthermore, to exploit multimodal information, we fuse the audio and video features to emotion regression model. The experimental results show that our method can achieve the state-of-the-art results in dimensional emotion recognition with the highest concordance correlation coefficient (CCC) on AV+EC 2017 dataset.
引用
收藏
页码:27 / 32
页数:6
相关论文
共 50 条
  • [41] Leveraging recent advances in deep learning for audio-Visual emotion recognition
    Schoneveld, Liam
    Othmani, Alice
    Abdelkawy, Hazem
    [J]. PATTERN RECOGNITION LETTERS, 2021, 146 : 1 - 7
  • [42] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
    Ma, Fei
    Zhang, Wei
    Li, Yang
    Huang, Shao-Lun
    Zhang, Lin
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23
  • [43] Audio-visual affect recognition
    Zeng, Zhihong
    Tu, Jilin
    Liu, Ming
    Huang, Thomas S.
    Pianfetti, Brian
    Roth, Dan
    Levinson, Stephen
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 424 - 428
  • [44] Audio-visual integration of emotion expression
    Collignon, Olivier
    Girard, Simon
    Gosselin, Frederic
    Roy, Sylvain
    Saint-Amour, Dave
    Lassonde, Maryse
    Lepore, Franco
    [J]. BRAIN RESEARCH, 2008, 1242 : 126 - 135
  • [45] Audio-visual gender recognition
    Liu, Ming
    Xu, Xun
    Huang, Thomas S.
    [J]. MIPPR 2007: PATTERN RECOGNITION AND COMPUTER VISION, 2007, 6788
  • [46] Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion Recognition
    Hsu, Jia-Hao
    Wu, Chung-Hsien
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (04) : 3231 - 3243
  • [47] Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
    Praveen, R. Gnana
    Cardinal, Patrick
    Granger, Eric
    [J]. IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2023, 5 (03): : 360 - 373
  • [48] MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
    Zhou, Pan
    Yang, Wenwen
    Chen, Wei
    Wang, Yanfeng
    Jia, Jia
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6565 - 6569
  • [49] AUDIO-VISUAL FUSION AND CONDITIONING WITH NEURAL NETWORKS FOR EVENT RECOGNITION
    Brousmiche, Mathilde
    Rouat, Jean
    Dupont, Stephane
    [J]. 2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
  • [50] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    [J]. VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +