Joint modelling of audio-visual cues using attention mechanisms for emotion recognition

被引:0
|
作者
Esam Ghaleb
Jan Niehues
Stylianos Asteriadis
机构
[1] Maastricht University,Department of Data Science and Knowledge Engineering
来源
关键词
Affective computing; Emotion recognition; Multimodal learning; Attention mechanisms;
D O I
暂无
中图分类号
学科分类号
摘要
Emotions play a crucial role in human-human communications with complex socio-psychological nature. In order to enhance emotion communication in human-computer interaction, this paper studies emotion recognition from audio and visual signals in video clips, utilizing facial expressions and vocal utterances. Thereby, the study aims to exploit temporal information of audio-visual cues and detect their informative time segments. Attention mechanisms are used to exploit the importance of each modality over time. We propose a novel framework that consists of bi-modal time windows spanning short video clips labeled with discrete emotions. The framework employs two networks, with each one being dedicated to one modality. As input to a modality-specific network, we consider a time-dependent signal deriving from the embeddings of the video and audio modalities. We employ the encoder part of the Transformer on the visual embeddings and another one on the audio embeddings. The research in this paper introduces detailed studies and meta-analysis findings, linking the outputs of our proposition to research from psychology. Specifically, it presents a framework to understand underlying principles of emotion recognition as functions of three separate setups in terms of modalities: audio only, video only, and the fusion of audio and video. Experimental results on two datasets show that the proposed framework achieves improved accuracy in emotion recognition, compared to state-of-the-art techniques and baseline methods not using attention mechanisms. The proposed method improves the results over baseline methods by at least 5.4%. Our experiments show that attention mechanisms reduce the gap between the entropies of unimodal predictions, which increases the bimodal predictions’ certainty and, therefore, improves the bimodal recognition rates. Furthermore, evaluations with noisy data in different scenarios are presented during the training and testing processes to check the framework’s consistency and the attention mechanism’s behavior. The results demonstrate that attention mechanisms increase the framework’s robustness when exposed to similar conditions during the training and the testing phases. Finally, we present comprehensive evaluations of emotion recognition as a function of time. The study shows that the middle time segments of a video clip are essential in the case of using audio modality. However, in the case of video modality, the importance of time windows is distributed equally.
引用
收藏
页码:11239 / 11264
页数:25
相关论文
共 50 条
  • [31] Semantic audio-visual data fusion for automatic emotion recognition
    Datcu, Dragos
    Rothkrantz, Leon J. M.
    [J]. EUROMEDIA '2008, 2008, : 58 - 65
  • [32] ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition
    Kim, Yelin
    Provost, Emily Mower
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (02) : 196 - 208
  • [33] Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
    Praveen, R. Gnana
    Granger, Eric
    Cardinal, Patrick
    [J]. 2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021), 2021,
  • [34] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [35] Human interaction categorization by using audio-visual cues
    Marin-Jimenez, M. J.
    Munoz-Salinas, R.
    Yeguas-Bolivar, E.
    Perez de la Blanca, N.
    [J]. MACHINE VISION AND APPLICATIONS, 2014, 25 (01) : 71 - 84
  • [36] Construction of Japanese Audio-Visual Emotion Database and Its Application in Emotion Recognition
    Lubis, Nurul
    Gomez, Randy
    Sakti, Sakriani
    Nakamura, Keisuke
    Yoshino, Koichiro
    Nakamura, Satoshi
    Nakadai, Kazuhiro
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2180 - 2184
  • [37] Vehicle Detection and Classification using Audio-Visual cues
    Piyush, P.
    Rajan, Rajeev
    Mary, Leena
    Koshy, Bino I.
    [J]. 2016 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2016, : 732 - 736
  • [38] Object category detection using audio-visual cues
    Luo, Jie
    Caputo, Barbara
    Zweig, Alon
    Bach, Joerg-Hendrik
    Anemueller, Joern
    [J]. COMPUTER VISION SYSTEMS, PROCEEDINGS, 2008, 5008 : 539 - 548
  • [39] Human interaction categorization by using audio-visual cues
    M. J. Marín-Jiménez
    R. Muñoz-Salinas
    E. Yeguas-Bolivar
    N. Pérez de la Blanca
    [J]. Machine Vision and Applications, 2014, 25 : 71 - 84
  • [40] Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    Alam, Jahangir
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2024, 18 (03) : 444 - 458