A Waveform-Feature Dual Branch Acoustic Embedding Network for Emotion Recognition

被引：3

作者：

Li, Jeng-Lin ^{[1
,2
]}

Huang, Tzu-Yun ^{[1
,2
]}

Chang, Chun-Min ^{[1
,2
]}

Lee, Chi-Chun ^{[1
,2
]}

机构：

[1] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu, Taiwan

[2] MOST Joint Res Ctr Al Technol & All Vista Hlthca, Taipei, Taiwan

来源：

FRONTIERS IN COMPUTER SCIENCE | 2020年 / 2卷

关键词：

speech emotion recognition; raw waveform; end-to-end; complementary learning; acoustic representation; SPEECH; IDENTIFICATION; MODEL;

D O I：

10.3389/fcomp.2020.00013

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Research in advancing speech emotion recognition (SER) has attracted a lot of attention due to its critical role for better human behaviors understanding scientifically and comprehensive applications commercially. Conventionally, performing SER highly relies on hand-crafted acoustic features. The recent progress in deep learning has attempted to model emotion directly from raw waveform in an end-to-end learning scheme; however, this particular approach remains to be generally a sub-optimal approach. An alternative direction has been proposed to enhance and augment the knowledge-based acoustic representation with affect-related representation derived directly from raw waveform. Here, we propose a complimentary waveform-feature dual branch learning network, termed as Dual-Complementary Acoustic Embedding Network (DCaEN), to effectively integrate psychoacoustic knowledge and raw waveform embedding within an augmented feature space learning approach. DCaEN contains an acoustic feature embedding network and a raw waveform network, that is learned by integrating negative cosine distance constraint in the loss function. The experiment results show that DCaEN can achieve 59.31 an 46.73% unweighted average recall (UAR) in the USC IEMOCAP and the MSP-IMPROV speech emotion databases, which improves the performance compared to modeling either acoustic hand-crafted features or raw waveform only and without this particular loss constraint. Further analysis illustrates a reverse mirroring pattern in the learned latent space demonstrating the complementary nature of DCaEN feature space learning.

引用

页数：13

共 50 条

[1] A Dual-Complementary Acoustic Embedding Network Learned from Raw Waveform for Speech Emotion Recognition
Huang, Tzu-Yun
Li, Jeng-Lin
Chang, Chun-Min
Lee, Chi-Chun
2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
[2] A Dual-Branch Network With Feature Assistance for Automatic Modulation Recognition
Feng, Yuhang
Duan, Ruifeng
Li, Shurui
Cheng, Peng
Liu, Wanchun
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 701 - 705
[3] A Dual-Branch Dynamic Graph Convolution Based Adaptive TransFormer Feature Fusion Network for EEG Emotion Recognition
Sun, Mingyi
Cui, Weigang
Yu, Shuyue
Han, Hongbin
Hu, Bin
Li, Yang
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (04) : 2218 - 2228
[4] Dual-Branch Multimodal Fusion Network for Driver Facial Emotion Recognition
Wang, Le
Chang, Yuchen
Wang, Kaiping
APPLIED SCIENCES-BASEL, 2024, 14 (20):
[5] Speech Emotion Recognition Using Speech Feature and Word Embedding
Atmaja, Bagus Tris
Shirai, Kiyoaki
Akagi, Masato
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 519 - 523
[6] Attention Learning with Retrievable Acoustic Embedding of Personality for Emotion Recognition
Li, Jeng-Lin
Lee, Chi-Chun
2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
[7] Hierarchical Network with Label Embedding for Contextual Emotion Recognition
Deng, Jiawen
Ren, Fuji
RESEARCH, 2021, 2021
[8] Acoustic feature analysis and optimization for Bangla speech emotion recognition
Sultana, Sadia
Rahman, Mohammad Shahidur
ACOUSTICAL SCIENCE AND TECHNOLOGY, 2023, 44 (03) : 157 - 166
[9] Speech Emotion Recognition Based on Multi Acoustic Feature Fusion
Xiang, Shanshan
Anwer, Sadiyagul
Yilahun, Hankiz
Hamdulla, Askar
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2024, 2025, 2312 : 338 - 346
[10] Acoustic feature selection for automatic emotion recognition from speech
Rong, Jia
Li, Gang
Chen, Yi-Ping Phoebe
INFORMATION PROCESSING & MANAGEMENT, 2009, 45 (03) : 315 - 328

← 1 2 3 4 5 →