Attention-enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

被引:35
|
作者
Zhao, Ziping [1 ,2 ]
Bao, Zhongtian [1 ]
Zhang, Zixing [3 ]
Cummins, Nicholas [2 ]
Wang, Haishuai [1 ]
Schuller, Bjorn W. [2 ,3 ]
机构
[1] Tianjin Normal Univ, Coll Comp & Informat Engn, Tianjin, Peoples R China
[2] Univ Augsburg, ZDB Chair Embedded Intelligence Hlth Care & Wellb, Augsburg, Germany
[3] Imperial Coll London, GLAM Grp Language Audio & Mus, London, England
来源
基金
中国国家自然科学基金;
关键词
speech emotion recognition; connectionist temporal classification; attention mechanism; bidirectional LSTM;
D O I
10.21437/Interspeech.2019-1649
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Discrete speech emotion recognition (SER), the assignment of a single emotion label to an entire speech utterance, is typically performed as a sequence-to-label task. This approach, however, is limited, in that it can result in models that do not capture temporal changes in the speech signal, including those indicative of a particular emotion. One potential solution to overcome this limitation is to model SER as a sequence-to-sequence task instead. In this regard, we have developed an attention-based bidirectional long short-term memory (BLSTM) neural network in combination with a connectionist temporal classification (CTC) objective function (Attention-BLSTM-CTC) for SER. We also assessed the benefits of incorporating two contemporary attention mechanisms, namely component attention and quantum attention, into the CTC framework. To the best of the authors' knowledge, this is the first time that such a hybrid architecture has been employed for SER. We demonstrated the effectiveness of our approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpora. The experimental results demonstrate that our proposed model outperforms current state-of-the-art approaches.
引用
收藏
页码:206 / 210
页数:5
相关论文
共 50 条
  • [1] SELF-ATTENTION NETWORKS FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION
    Salazar, Julian
    Kirchhoff, Katrin
    Huang, Zhiheng
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7115 - 7119
  • [2] Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach
    Ghebregiorgis, Bereket Desbele
    Tekle, Yonatan Yosef
    Kidane, Mebrahtu Fisshaye
    Keleta, Mussie Kaleab
    Ghebraeb, Rutta Fissehatsion
    Gebretatios, Daniel Tesfai
    [J]. PAN-AFRICAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PT I, PANAFRICON AI 2023, 2024, 2068 : 221 - 236
  • [3] Temporal Discrete Cosine Transform for Speech Emotion Recognition
    Popovic, Branislav
    Stankovic, Igor
    Ostrogonac, Stevan
    [J]. 2013 IEEE 4TH INTERNATIONAL CONFERENCE ON COGNITIVE INFOCOMMUNICATIONS (COGINFOCOM), 2013, : 87 - 90
  • [4] ATTENTION-ENHANCED SENSORIMOTOR OBJECT RECOGNITION
    Thermos, Spyridon
    Papadopoulos, Georgios Th.
    Daras, Petros
    Potamianos, Gerasimos
    [J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 336 - 340
  • [5] Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition
    Wang, Pengrui
    Li, Jie
    Xu, Bo
    [J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [6] ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING
    Das, Amit
    Li, Jinyu
    Zhao, Rui
    Gong, Yifan
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4769 - 4773
  • [7] Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation
    Liu, Jiaxing
    Liu, Zhilei
    Wang, Longbiao
    Gao, Yuan
    Guo, Lili
    Dang, Jianwu
    [J]. INTERSPEECH 2020, 2020, : 2337 - 2341
  • [8] Attention-Enhanced CNN for Chinese Calligraphy Styles Classification
    Zhang, Jiulong
    Yu, Wenhang
    Wang, Zhixiao
    Li, Junhuai
    Pan, Zhigeng
    [J]. 2021 IEEE 7TH INTERNATIONAL CONFERENCE ON VIRTUAL REALITY (ICVR 2021), 2021, : 352 - 358
  • [9] Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition
    Shi, Zhensheng
    Cao, Liangjie
    Guan, Cheng
    Zheng, Haiyong
    Gu, Zhaorui
    Yu, Zhibin
    Zheng, Bing
    [J]. IEEE ACCESS, 2020, 8 (08): : 16785 - 16794
  • [10] REPRESENTATION LEARNING WITH SPECTRO-TEMPORAL-CHANNEL ATTENTION FOR SPEECH EMOTION RECOGNITION
    Guo, Lili
    Wang, Longbiao
    Xu, Chenglin
    Dang, Jianwu
    Chng, Eng Siong
    Li, Haizhou
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6304 - 6308