Temporal conditional Wasserstein GANs for audio-visual affect-related ties

被引：0

作者：

Athanasiadis, Christos ^{[1
]}

Hortal, Enrique ^{[1
]}

Asteriadis, Stelios ^{[1
]}

机构：

[1] Maastricht Univ, Maastricht, Netherlands

来源：

2021 9TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS (ACIIW) | 2021年

关键词：

Domain Adaptation; Audio Emotion Recognition; Generative Adversarial Networks; Attention Mechanisms;

D O I：

10.1109/ACIIW52867.2021.9666277

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.

引用

页数：8

共 50 条

[21] Audio-Visual Event Classification via Spatial-Temporal-Audio Words
Cao, Yu
Baang, Sung
Liu, Shih-Hsi 'Alex'
Li, Ming
Hu, Sanqing
[J]. 19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 858 - +
[22] Scoring the Nation Audio-Visual Affect in Irish History Films
Sehman, Steven
Murchu, Niall o
[J]. MUSIC SOUND AND THE MOVING IMAGE, 2024, 18 (01) : 29 - 52
[23] Audio-visual affect recognition in activation-evaluation space
Zeng, ZH
Zhang, ZQ
Pianfetti, B
Tu, JL
Huang, TS
[J]. 2005 IEEE International Conference on Multimedia and Expo (ICME), Vols 1 and 2, 2005, : 828 - 831
[24] The temporal dynamics of conscious and unconscious audio-visual semantic integration
Gao, Mingjie
Zhu, Weina
Drewes, Jan
[J]. HELIYON, 2024, 10 (13)
[25] Recalibration of temporal order perception by exposure to audio-visual asynchrony
Vroomen, J
Keetels, M
de Gelder, B
Bertelson, P
[J]. COGNITIVE BRAIN RESEARCH, 2004, 22 (01): : 32 - 35
[26] Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition
Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
[J]. 2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
[27] Audio-Visual Temporal Saliency Modeling Validated by fMRI Data
Koutras, Petros
Panagiotaropoulou, Georgia
Tsiami, Antigoni
Maragos, Petros
[J]. PROCEEDINGS 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2018, : 2081 - 2091
[28] Selective Attention Modulates the Direction of Audio-Visual Temporal Recalibration
Ikumi, Nara
Soto-Faraco, Salvador
[J]. PLOS ONE, 2014, 9 (07):
[29] ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition
Kim, Yelin
Provost, Emily Mower
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (02) : 196 - 208
[30] Window of audio-visual simultaneity is unaffected by spatio-temporal visual clutter
Erik Van der Burg
John Cass
David Alais
[J]. Scientific Reports, 4

← 1 2 3 4 5 →