Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition

被引:0
|
作者
Masoud Geravanchizadeh
Elnaz Forouhandeh
Meysam Bashirpour
机构
[1] University of Tabriz,Faculty of Electrical and Computer Engineering
关键词
Emotion-affected speech recognition; Vocal tract length normalization; Frequency warping; Acoustic feature normalization;
D O I
暂无
中图分类号
学科分类号
摘要
The performance of speech recognition systems trained with neutral utterances degrades significantly when these systems are tested with emotional speech. Since everybody can speak emotionally in the real-world environment, it is necessary to take account of the emotional states of speech in the performance of the automatic speech recognition system. Limited works have been performed in the field of emotion-affected speech recognition and so far, most of the researches have focused on the classification of speech emotions. In this paper, the vocal tract length normalization method is employed to enhance the robustness of the emotion-affected speech recognition system. For this purpose, two structures of the speech recognition system based on hybrids of hidden Markov model with Gaussian mixture model and deep neural network are used. To achieve this goal, frequency warping is applied to the filterbank and/or discrete-cosine transform domain(s) in the feature extraction process of the automatic speech recognition system. The warping process is conducted in a way to normalize the emotional feature components and make them close to their corresponding neutral feature components. The performance of the proposed system is evaluated in neutrally trained/emotionally tested conditions for different speech features and emotional states (i.e., Anger, Disgust, Fear, Happy, and Sad). In this system, frequency warping is employed for different acoustical features. The constructed emotion-affected speech recognition system is based on the Kaldi automatic speech recognition with the Persian emotional speech database and the crowd-sourced emotional multi-modal actors dataset as the input corpora. The experimental simulations reveal that, in general, the warped emotional features result in better performance of the emotion-affected speech recognition system as compared with their unwarped counterparts. Also, it can be seen that the performance of the speech recognition using the deep neural network-hidden Markov model outperforms the system employing the hybrid with the Gaussian mixture model.
引用
收藏
相关论文
共 50 条
  • [1] Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition
    Geravanchizadeh, Masoud
    Forouhandeh, Elnaz
    Bashirpour, Meysam
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [2] A novel feature transformation for vocal tract length normalization in automatic speech recognition
    Claes, T
    Dologlou, I
    ten Bosch, L
    Van Compernolle, D
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (06): : 549 - 557
  • [3] Frequency warping approach for vocal tract length normalization in speech recognition
    Xu, W
    Wang, BX
    Ding, Q
    [J]. PROCEEDINGS OF THE THIRD INTERNATIONAL SYMPOSIUM ON INSTRUMENTATION SCIENCE AND TECHNOLOGY, VOL 2, 2004, : 494 - 499
  • [4] Enhancing Vocal Tract Length Normalization with Elastic Registration for Automatic Speech Recognition
    Mueller, Florian
    Mertins, Alfred
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1362 - 1365
  • [5] Vocal Tract Length Normalization for Statistical Parametric Speech Synthesis
    Saheer, Lakshmi
    Dines, John
    Garner, Philip N.
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (07): : 2134 - 2148
  • [6] Vocal tract length normalization using rapid maximum-likelihood estimation for speech recognition
    Emori, Tadashi
    Shinoda, Koichi
    [J]. Systems and Computers in Japan, 2002, 33 (05): : 30 - 40
  • [7] Prosodic feature normalization for emotion recognition by using synthesized speech
    Suzuki, Motoyuki
    Nakagawa, Shohei
    Kita, Kenji
    [J]. ADVANCES IN KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, 2012, 243 : 306 - 313
  • [8] Vocal Tract Length Normalization for Vowel Recognition in Low Resource Languages
    Sharma, Shubham
    Madhavi, Maulik C.
    Patil, Hemant A.
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2014), 2014, : 54 - 57
  • [9] IG-based feature extraction and compensation for emotion recognition from speech
    Chuang, ZJ
    Wu, CH
    [J]. AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PROCEEDINGS, 2005, 3784 : 358 - 365
  • [10] A Study on the Influence of Covariance Adaptation on Jacobian Compensation in Vocal Tract Length Normalization
    Sanand, D. R.
    Rath, S. P.
    Umesh, S.
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 544 - 547