Temporal Relation Inference Network for Multimodal Speech Emotion Recognition

被引:11
|
作者
Dong, Guan-Nan [1 ]
Pun, Chi-Man [1 ]
Zhang, Zheng [1 ,2 ]
机构
[1] Univ Macau, Dept Comp & Informat Sci, Macau, Peoples R China
[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen 150001, Peoples R China
关键词
Feature extraction; Emotion recognition; Speech recognition; Cognition; Hidden Markov models; Correlation; Task analysis; Speech emotion recognition; multi-modal learning; temporal learning; relation inference network; SENTIMENT ANALYSIS; MODEL; FEATURES;
D O I
10.1109/TCSVT.2022.3163445
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech emotion recognition (SER) is a non-trivial task for humans, while it remains challenging for automatic SER due to the linguistic complexity and contextual distortion. Notably, previous automatic SER systems always regarded multi-modal information and temporal relations of speech as two independent tasks, ignoring their association. We argue that the valid semantic features and temporal relations of speech are both meaningful event relationships. This paper proposes a novel temporal relation inference network (TRIN) to help tackle multi-modal SER, which fully considers the underlying hierarchy of phonetic structure and its associations between various modalities under the sequential temporal guidance. Mainly, we design a temporal reasoning calibration module to imitate real and abundant contextual conditions. Unlike the previous works, which assume all multiple modalities are related, it infers the dependency relationship between the semantic information from the temporal level and learns to handle the multi-modal interaction sequence with a flexible order. To enhance the feature representation, an innovative temporal attentive fusion unit is developed to magnify the details embedded in a single modality from semantic level. Meanwhile, it aggregates the feature representation from both the temporal and semantic levels to maximize the integrity of feature representation by an adaptive feature fusion mechanism to selectively collect the implicit complementary information to strengthen the dependencies between different information subspaces. Extensive experiments conducted on two benchmark datasets demonstrate the superiority of our TRIN method against some state-of-the-art SER methods.
引用
收藏
页码:6472 / 6485
页数:14
相关论文
共 50 条
  • [1] Speech Expression Multimodal Emotion Recognition Based on Deep Belief Network
    Liu, Dong
    Chen, Longxi
    Wang, Zhiyong
    Diao, Guangqiang
    [J]. JOURNAL OF GRID COMPUTING, 2021, 19 (02)
  • [2] Speech Expression Multimodal Emotion Recognition Based on Deep Belief Network
    Dong Liu
    Longxi Chen
    Zhiyong Wang
    Guangqiang Diao
    [J]. Journal of Grid Computing, 2021, 19
  • [3] Towards the explainability of Multimodal Speech Emotion Recognition
    Kumar, Puneet
    Kaushik, Vishesh
    Raman, Balasubramanian
    [J]. INTERSPEECH 2021, 2021, : 1748 - 1752
  • [4] Multimodal speech emotion recognition and classification using convolutional neural network techniques
    A. Christy
    S. Vaithyasubramanian
    A. Jesudoss
    M. D. Anto Praveena
    [J]. International Journal of Speech Technology, 2020, 23 : 381 - 388
  • [5] Multimodal speech emotion recognition and classification using convolutional neural network techniques
    Christy, A.
    Vaithyasubramanian, S.
    Jesudoss, A.
    Praveena, M. D. Anto
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (02) : 381 - 388
  • [6] MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION
    Sun, Licai
    Liu, Bin
    Tao, Jianhua
    Lian, Zheng
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4275 - 4279
  • [7] Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation
    Liu, Jiaxing
    Liu, Zhilei
    Wang, Longbiao
    Gao, Yuan
    Guo, Lili
    Dang, Jianwu
    [J]. INTERSPEECH 2020, 2020, : 2337 - 2341
  • [8] Temporal Context in Speech Emotion Recognition
    Xia, Yangyang
    Chen, Li-Wei
    Rudnicky, Alexander
    Stern, Richard M.
    [J]. INTERSPEECH 2021, 2021, : 3370 - 3374
  • [9] Multimodal Emotion Recognition With Temporal and Semantic Consistency
    Chen, Bingzhi
    Cao, Qi
    Hou, Mixiao
    Zhang, Zheng
    Lu, Guangming
    Zhang, David
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3592 - 3603
  • [10] Multimodal transformer augmented fusion for speech emotion recognition
    Wang, Yuanyuan
    Gu, Yu
    Yin, Yifei
    Han, Yingping
    Zhang, He
    Wang, Shuang
    Li, Chenyu
    Quan, Dou
    [J]. FRONTIERS IN NEUROROBOTICS, 2023, 17