Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition

被引:0
|
作者
Md Shah Fahad
Ashish Ranjan
Akshay Deepak
Gayadhar Pradhan
机构
[1] National Institute of Technology Patna,Department of Computer Science and Engineering
[2] Vellore Institute of Technology,School of Computing Science and Engineering
[3] Siksha ‘O’ Anusandhan (Deemed to be University),Department of Computer Science and Engineering
[4] National Institute of Technology Patna,Department of Electronics and Communication
关键词
Speech emotion recognition; Gradient reversal layer (GRL); Domain adversarial neural network (DANN); Speaker adversarial neural network (SANN); Speaker-independent; Speaker-invariant.;
D O I
暂无
中图分类号
学科分类号
摘要
Recently, domain adversarial neural networks (DANN) have delivered promising results for out of domain data. This paper exploits DANN for speaker independent emotion recognition, where the domain corresponds to speakers, i.e. the training and testing datasets contain different speakers. The result is a speaker adversarial neural network (SANN). The proposed SANN is used for extracting speaker-invariant and emotion-specific discriminative features for the task of speech emotion recognition. To extract speaker-invariant features, multi-tasking adversarial training of a deep neural network (DNN) is employed. The DNN framework consists of two sub-networks: one for emotion classification (primary task) and the other for speaker classification (secondary task). The gradient reversal layer (GRL) was introduced between (a) the layer common to both the primary and auxiliary classifiers and (b) the auxiliary classifier. The objective of the GRL layer is to reduce the variance among speakers by maximizing the speaker classification loss. The proposed framework jointly optimizes the above two sub-networks to minimize the emotion classification loss and mini-maximize the speaker classification loss. The proposed network was evaluated on the IEMOCAP and EMODB datasets. A total of 1582 features were extracted from the standard library openSMILE. A subset of these features was eventually selected using a genetic algorithm approach. On the IEMOCAP dataset, the proposed SANN model achieved relative improvements of +6.025% (weighted accuracy) and +5.62% (unweighted accuracy) over the baseline system. Similar results were observed for the EMODB dataset. Further, in spite of differences with respect to models and features with state-of-the-art methods, significant improvement in accuracy values was also obtained over them.
引用
收藏
页码:6113 / 6135
页数:22
相关论文
共 50 条
  • [21] Generalized Cyclic Transformations in Speaker-Independent Speech Recognition
    Mueller, Florian
    Belilovsky, Eugene
    Mertins, Alfred
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 211 - 215
  • [22] Speaker-independent speech recognition by means of functional-link neural networks
    Ugena, A
    de Arriaga, F
    El Alami, M
    [J]. 15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGS: PATTERN RECOGNITION AND NEURAL NETWORKS, 2000, : 1018 - 1021
  • [23] A FEATURE SELECTION AND FEATURE FUSION COMBINATION METHOD FOR SPEAKER-INDEPENDENT SPEECH EMOTION RECOGNITION
    Jin, Yun
    Song, Peng
    Zheng, Wenming
    Zhao, Li
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [24] Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs
    Liu, Zhen-Tao
    Xiao, Peng
    Li, Dan-Yun
    Hao, Man
    [J]. INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2019, PT III, 2019, 11742 : 481 - 491
  • [25] Speaker-Independent Silent Speech Recognition with Across-Speaker Articulatory Normalization and Speaker Adaptive Training
    Wang, Jun
    Hahm, Seongjun
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2415 - 2419
  • [26] Compact modular neural networks in a hybrid speaker-independent speech recognition system
    Glaeser, A
    [J]. ICNN - 1996 IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, VOLS. 1-4, 1996, : 1895 - 1899
  • [27] Speaker-Independent Speech Emotion Recognition Based Multiple Kernel Learning of Collaborative Representation
    Zha, Cheng
    Zhang, Xinrang
    Zhao, Li
    Liang, Ruiyu
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2016, E99A (03) : 756 - 759
  • [28] Study on Speaker-Independent Emotion Recognition from Speech on Real-World Data
    Kostoulas, Theodoros
    Ganchev, Todor
    Fakotakis, Nikos
    [J]. VERBAL AND NONVERBAL FEATURES OF HUMAN-HUMAN AND HUMAN-MACHINE INTERACTIONS, 2008, 5042 : 235 - 242
  • [29] Speaker-independent Speech Emotion Recognition Based on Random Forest Feature Selection Algorithm
    Cao, Wei-Hua
    Xu, Jian-Ping
    Liu, Zhen-Tao
    [J]. PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017), 2017, : 10995 - 10998
  • [30] DYNAMIC SPEAKER ADAPTATION IN SPEAKER-INDEPENDENT WORD RECOGNITION
    HEWETT, AJ
    HOLMES, G
    YOUNG, SJ
    [J]. PROCEEDINGS : INSTITUTE OF ACOUSTICS, VOL 8, PART 7: SPEECH & HEARING, 1986, 8 : 275 - 282