A Waveform-Feature Dual Branch Acoustic Embedding Network for Emotion Recognition

被引:3
|
作者
Li, Jeng-Lin [1 ,2 ]
Huang, Tzu-Yun [1 ,2 ]
Chang, Chun-Min [1 ,2 ]
Lee, Chi-Chun [1 ,2 ]
机构
[1] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu, Taiwan
[2] MOST Joint Res Ctr Al Technol & All Vista Hlthca, Taipei, Taiwan
来源
FRONTIERS IN COMPUTER SCIENCE | 2020年 / 2卷
关键词
speech emotion recognition; raw waveform; end-to-end; complementary learning; acoustic representation; SPEECH; IDENTIFICATION; MODEL;
D O I
10.3389/fcomp.2020.00013
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Research in advancing speech emotion recognition (SER) has attracted a lot of attention due to its critical role for better human behaviors understanding scientifically and comprehensive applications commercially. Conventionally, performing SER highly relies on hand-crafted acoustic features. The recent progress in deep learning has attempted to model emotion directly from raw waveform in an end-to-end learning scheme; however, this particular approach remains to be generally a sub-optimal approach. An alternative direction has been proposed to enhance and augment the knowledge-based acoustic representation with affect-related representation derived directly from raw waveform. Here, we propose a complimentary waveform-feature dual branch learning network, termed as Dual-Complementary Acoustic Embedding Network (DCaEN), to effectively integrate psychoacoustic knowledge and raw waveform embedding within an augmented feature space learning approach. DCaEN contains an acoustic feature embedding network and a raw waveform network, that is learned by integrating negative cosine distance constraint in the loss function. The experiment results show that DCaEN can achieve 59.31 an 46.73% unweighted average recall (UAR) in the USC IEMOCAP and the MSP-IMPROV speech emotion databases, which improves the performance compared to modeling either acoustic hand-crafted features or raw waveform only and without this particular loss constraint. Further analysis illustrates a reverse mirroring pattern in the learned latent space demonstrating the complementary nature of DCaEN feature space learning.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] A Dual-Complementary Acoustic Embedding Network Learned from Raw Waveform for Speech Emotion Recognition
    Huang, Tzu-Yun
    Li, Jeng-Lin
    Chang, Chun-Min
    Lee, Chi-Chun
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [2] A Dual-Branch Network With Feature Assistance for Automatic Modulation Recognition
    Feng, Yuhang
    Duan, Ruifeng
    Li, Shurui
    Cheng, Peng
    Liu, Wanchun
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 701 - 705
  • [3] A Dual-Branch Dynamic Graph Convolution Based Adaptive TransFormer Feature Fusion Network for EEG Emotion Recognition
    Sun, Mingyi
    Cui, Weigang
    Yu, Shuyue
    Han, Hongbin
    Hu, Bin
    Li, Yang
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (04) : 2218 - 2228
  • [4] Dual-Branch Multimodal Fusion Network for Driver Facial Emotion Recognition
    Wang, Le
    Chang, Yuchen
    Wang, Kaiping
    APPLIED SCIENCES-BASEL, 2024, 14 (20):
  • [5] Speech Emotion Recognition Using Speech Feature and Word Embedding
    Atmaja, Bagus Tris
    Shirai, Kiyoaki
    Akagi, Masato
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 519 - 523
  • [6] Attention Learning with Retrievable Acoustic Embedding of Personality for Emotion Recognition
    Li, Jeng-Lin
    Lee, Chi-Chun
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [7] Hierarchical Network with Label Embedding for Contextual Emotion Recognition
    Deng, Jiawen
    Ren, Fuji
    RESEARCH, 2021, 2021
  • [8] Acoustic feature analysis and optimization for Bangla speech emotion recognition
    Sultana, Sadia
    Rahman, Mohammad Shahidur
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2023, 44 (03) : 157 - 166
  • [9] Speech Emotion Recognition Based on Multi Acoustic Feature Fusion
    Xiang, Shanshan
    Anwer, Sadiyagul
    Yilahun, Hankiz
    Hamdulla, Askar
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2024, 2025, 2312 : 338 - 346
  • [10] Acoustic feature selection for automatic emotion recognition from speech
    Rong, Jia
    Li, Gang
    Chen, Yi-Ping Phoebe
    INFORMATION PROCESSING & MANAGEMENT, 2009, 45 (03) : 315 - 328