Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech

被引：0

作者：

Almotlak, Hussam ^{[1
]}

Weber, Cornelius ^{[1
]}

Qu, Leyuan ^{[1
]}

Wermter, Stefan ^{[1
]}

机构：

[1] Univ Hamburg, Dept Informat, Knowledge Technol, Hamburg, Germany

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2020, PT I | 2020年 / 12396卷

关键词：

Unsupervised learning; Feature extraction; Variational autoencoders; VAE with auxiliary variables; Multi-timescale neural network; Speaker identification; Emotion recognition;

D O I：

10.1007/978-3-030-61609-0_42

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Unsupervised learning is based on the idea of self-organization to find hidden patterns and features in the data without the need for labels. Variational autoencoders (VAEs) are generative unsupervised learning models that create low-dimensional representations of the input data and learn by regenerating the same input from that representation. Recently, VAEs were used to extract representations from audio data, which possess not only content-dependent information but also speaker-dependent information such as gender, health status, and speaker ID. VAEs with two timescale variables were then introduced to disentangle these two kinds of information from each other. Our approach introduces a third, i.e. medium timescale into a VAE. So instead of having only a global and a local timescale variable, this model holds a global, a medium, and a local variable. We tested the model on three downstream applications: speaker identification, gender classification, and emotion recognition, where each hidden representation performed better on some specific tasks than the other hidden representations. Speaker ID and gender were best reported by the global variable, while emotion was best extracted when using the medium. Our model achieves excellent results exceeding state-of-the-art models on speaker identification and emotion regression from audio.

引用

页码：529 / 540

页数：12

共 50 条

[1] Autoencoder With Emotion Embedding for Speech Emotion Recognition
Zhang, Chenghao
Xue, Lei
[J]. IEEE ACCESS, 2021, 9 : 51231 - 51241
[2] Autoencoder with emotion embedding for speech emotion recognition
Zhang, Chenghao
Xue, Lei
[J]. IEEE Access, 2021, 9 : 51231 - 51241
[3] Disentangled Variational Autoencoder for Emotion Recognition in Conversations
Yang, Kailai
Zhang, Tianlin
Ananiadou, Sophia
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (02) : 508 - 518
[4] Bimodal variational autoencoder for audiovisual speech recognition
Hadeer M. Sayed
Hesham E. ElDeeb
Shereen A. Taie
[J]. Machine Learning, 2023, 112 : 1201 - 1226
[5] Bimodal variational autoencoder for audiovisual speech recognition
Sayed, Hadeer M.
ElDeeb, Hesham E.
Taie, Shereen A.
[J]. MACHINE LEARNING, 2023, 112 (04) : 1201 - 1226
[6] Speech Emotion Recognition 'in the wild' Using an Autoencoder
Dissanayake, Vipula
Zhang, Haimo
Billinghurst, Mark
Nanayakkara, Suranga
[J]. INTERSPEECH 2020, 2020, : 526 - 530
[7] Speech emotion recognition in Persian based on stacked autoencoder by comparing local and global features
Bastanfard, Azam
Abbasian, Alireza
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (23) : 36413 - 36430
[8] Speech emotion recognition in Persian based on stacked autoencoder by comparing local and global features
Azam Bastanfard
Alireza Abbasian
[J]. Multimedia Tools and Applications, 2023, 82 : 36413 - 36430
[9] A VECTOR QUANTIZED MASKED AUTOENCODER FOR SPEECH EMOTION RECOGNITION
Sadok, Samir
Leglaive, Simon
Seguier, Renaud
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[10] Sparse Autoencoder with Attention Mechanism for Speech Emotion Recognition
Sun, Ting-Wei
Wu, An-Yeu
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS 2019), 2019, : 146 - 149

← 1 2 3 4 5 →