Temporal conditional Wasserstein GANs for audio-visual affect-related ties

被引:0
|
作者
Athanasiadis, Christos [1 ]
Hortal, Enrique [1 ]
Asteriadis, Stelios [1 ]
机构
[1] Maastricht Univ, Maastricht, Netherlands
关键词
Domain Adaptation; Audio Emotion Recognition; Generative Adversarial Networks; Attention Mechanisms;
D O I
10.1109/ACIIW52867.2021.9666277
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Transfer of Audio-Visual Temporal Training to Temporal and Spatial Audio-Visual Tasks
    Suerig, Ralf
    Bottari, Davide
    Roeder, Brigitte
    [J]. MULTISENSORY RESEARCH, 2018, 31 (06) : 556 - 578
  • [2] Temporal structure and complexity affect audio-visual correspondence detection
    Denison, Rachel N.
    Driver, Jon
    Ruff, Christian C.
    [J]. FRONTIERS IN PSYCHOLOGY, 2013, 3
  • [3] Audio-visual affect recognition
    Zeng, Zhihong
    Tu, Jilin
    Liu, Ming
    Huang, Thomas S.
    Pianfetti, Brian
    Roth, Dan
    Levinson, Stephen
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 424 - 428
  • [4] Audio-Visual Causality and Stimulus Reliability Affect Audio-Visual Synchrony Perception
    Li, Shao
    Ding, Qi
    Yuan, Yichen
    Yue, Zhenzhu
    [J]. FRONTIERS IN PSYCHOLOGY, 2021, 12
  • [5] Audio-visual integration in temporal perception
    Wada, Y
    Kitagawa, N
    Noguchi, K
    [J]. INTERNATIONAL JOURNAL OF PSYCHOPHYSIOLOGY, 2003, 50 (1-2) : 117 - 124
  • [6] DEEP INDEPENDENT AUDIO-VISUAL AFFECT ANALYSIS
    Thomas, Titus
    Dominguez, Miguel
    Ptucha, Raymond
    [J]. 2017 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP 2017), 2017, : 1417 - 1421
  • [7] Audio-Visual Automatic Group Affect Analysis
    Sharma, Garima
    Dhall, Abhinav
    Cai, Jianfei
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (02) : 1056 - 1069
  • [8] The Development of Audio-Visual Integration for Temporal Judgements
    Adams, Wendy J.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2016, 12 (04)
  • [9] Analysis of temporal perception for audio-visual stimulation
    Yu, Mi
    Lee, Sang-Min
    Piao, Yong-Jun
    Kwon, Tae-Kyu
    Kim, Nam-Gyun
    [J]. WORLD CONGRESS ON MEDICAL PHYSICS AND BIOMEDICAL ENGINEERING 2006, VOL 14, PTS 1-6, 2007, 14 : 591 - +
  • [10] A CONDITIONAL RANDOM FIELD APPROACH FOR AUDIO-VISUAL PEOPLE DIARIZATION
    Paul, Gay
    Elie, Khoury
    Sylvain, Meignier
    Jean-Marc, Odobez
    Paul, Deleglise
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,