Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

被引:0
|
作者
Md Shah Fahad
Ashish Ranjan
Akshay Deepak
Gayadhar Pradhan
机构
[1] National Institute of Technology Patna,Department of Computer Science and Engineering
[2] Vellore Institute of Technology,School of Computing Science and Engineering
[3] Siksha ‘O’ Anusandhan (Deemed to be University),Department of Computer Science and Engineering
[4] National Institute of Technology Patna,Department of Electronics and Communication
关键词
Speech emotion recognition; Gradient reversal layer (GRL); Domain adversarial neural network (DANN); Speaker adversarial neural network (SANN); Speaker-independent; Speaker-invariant.;
D O I
暂无
中图分类号
学科分类号
摘要
Recently, domain adversarial neural networks (DANN) have delivered promising results for out of domain data. This paper exploits DANN for speaker independent emotion recognition, where the domain corresponds to speakers, i.e. the training and testing datasets contain different speakers. The result is a speaker adversarial neural network (SANN). The proposed SANN is used for extracting speaker-invariant and emotion-specific discriminative features for the task of speech emotion recognition. To extract speaker-invariant features, multi-tasking adversarial training of a deep neural network (DNN) is employed. The DNN framework consists of two sub-networks: one for emotion classification (primary task) and the other for speaker classification (secondary task). The gradient reversal layer (GRL) was introduced between (a) the layer common to both the primary and auxiliary classifiers and (b) the auxiliary classifier. The objective of the GRL layer is to reduce the variance among speakers by maximizing the speaker classification loss. The proposed framework jointly optimizes the above two sub-networks to minimize the emotion classification loss and mini-maximize the speaker classification loss. The proposed network was evaluated on the IEMOCAP and EMODB datasets. A total of 1582 features were extracted from the standard library openSMILE. A subset of these features was eventually selected using a genetic algorithm approach. On the IEMOCAP dataset, the proposed SANN model achieved relative improvements of +6.025% (weighted accuracy) and +5.62% (unweighted accuracy) over the baseline system. Similar results were observed for the EMODB dataset. Further, in spite of differences with respect to models and features with state-of-the-art methods, significant improvement in accuracy values was also obtained over them.
引用
收藏
页码:6113 / 6135
页数:22
相关论文
共 50 条
  • [1] Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition
    Fahad, Md Shah
    Ranjan, Ashish
    Deepak, Akshay
    Pradhan, Gayadhar
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2022, 41 (11) : 6113 - 6135
  • [2] On Speaker-Independent, Speaker-Dependent, and Speaker-Adaptive Speech Recognition
    Huang, Xuedong
    Lee, Kai-Fu
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1993, 1 (02): : 150 - 157
  • [3] A SPEAKER-INDEPENDENT SPEECH RECOGNITION SYSTEM FOR TELEPHONE NETWORK APPLICATIONS
    TRNKA, R
    [J]. REVUE TECHNIQUE THOMSON-CSF, 1984, 16 (04): : 847 - 861
  • [4] Domain Invariant Feature Learning for Speaker-Independent Speech Emotion Recognition
    Lu, Cheng
    Zong, Yuan
    Zheng, Wenming
    Li, Yang
    Tang, Chuangao
    Schuller, Bjoern W.
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2217 - 2230
  • [5] Speaker adaptation techniques for speech recognition with a speaker-independent phonetic recognizer
    Kim, WG
    Jang, M
    [J]. COMPUTATIONAL INTELLIGENCE AND SECURITY, PT 1, PROCEEDINGS, 2005, 3801 : 95 - 100
  • [6] SPEAKER-INDEPENDENT VOWEL RECOGNITION IN PERSIAN SPEECH
    Nazari, Mohammad
    Sayadiyan, Abolghasem
    Valiollahzadeh, Seyyed Majid
    [J]. 2008 3RD INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES: FROM THEORY TO APPLICATIONS, VOLS 1-5, 2008, : 672 - 676
  • [7] SPEAKER-CONSISTENT PARSING FOR SPEAKER-INDEPENDENT CONTINUOUS SPEECH RECOGNITION
    YAMAGUCHI, K
    SINGER, H
    MATSUNAGA, S
    SAGAYAMA, S
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 719 - 724
  • [8] Japanese Speaker-Independent Homonyms Speech Recognition
    Murakami, Jin'ichi
    Hotta, Haseo
    [J]. COMPUTATIONAL LINGUISTICS AND RELATED FIELDS, 2011, 27 : 306 - 313
  • [9] PREDICTOR CODEBOOK FOR SPEAKER-INDEPENDENT SPEECH RECOGNITION
    KAWABATA, T
    [J]. SYSTEMS AND COMPUTERS IN JAPAN, 1994, 25 (01) : 37 - 46
  • [10] Biomimetic pattern recognition for speaker-independent speech recognition
    Qin, H
    Wang, SJ
    Sun, H
    [J]. PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON NEURAL NETWORKS AND BRAIN, VOLS 1-3, 2005, : 1290 - 1294