SPEAKER-INVARIANT TRAINING VIA ADVERSARIAL LEARNING

被引:0
|
作者
Meng, Zhong [1 ,2 ]
Li, Jinyu [1 ]
Chen, Zhuo [1 ]
Zhao, Yong [1 ]
Mazalov, Vadim [1 ]
Gong, Yifan [1 ]
Juang, Biing-Hwang [2 ]
机构
[1] Microsoft AI & Res, Redmond, WA 98052 USA
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
关键词
speaker-invariant training; adversarial learning; speech recognition; deep neural networks; DEEP NEURAL-NETWORKS; ADAPTATION;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system. We call the scheme speaker-invariant training (SIT). In SIT, a DNN acoustic model and a speaker classifier network are jointly optimized to minimize the senone (tied triphone state) classification loss, and simultaneously mini-maximize the speaker classification loss. A speaker-invariant and senone-discriminative deep feature is learned through this adversarial multi-task learning. With SIT, a canonical DNN acoustic model with significantly reduced variance in its output probabilities is learned with no explicit speaker-independent (SI) transformations or speaker-specific representations used in training or testing. Evaluated on the CHiME-3 dataset, the SIT achieves 4.99% relative word error rate (WER) improvement over the conventional SI acoustic model. With additional unsupervised speaker adaptation, the speaker-adapted (SA) SIT model achieves 4.86% relative WER gain over the SA SI acoustic model.
引用
收藏
页码:5969 / 5973
页数:5
相关论文
共 50 条
  • [1] SPEAKER-INVARIANT AFFECTIVE REPRESENTATION LEARNING VIA ADVERSARIAL TRAINING
    Li, Haoqi
    Tu, Ming
    Huang, Jing
    Narayanan, Shrikanth
    Georgiou, Panayiotis
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7144 - 7148
  • [2] Speaker-Invariant Feature-Mapping for Distant Speech Recognition via Adversarial Teacher-Student Learning
    Wu, Long
    Chen, Hangting
    Wang, Li
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. INTERSPEECH 2019, 2019, : 431 - 435
  • [3] Speaker-Invariant Features for Automatic Speech Recognition
    Umesh, S.
    Sanand, D. R.
    Praveen, G.
    [J]. 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2007, : 1738 - 1743
  • [4] CHANNEL INVARIANT SPEAKER EMBEDDING LEARNING WITH JOINT MULTI-TASK AND ADVERSARIAL TRAINING
    Chen, Zhengyang
    Wang, Shuai
    Qian, Yanmin
    Yu, Kai
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6574 - 6578
  • [5] Speaker-invariant suprasegmental temporal features in normal and disguised speech
    Leemann, Adrian
    Kolly, Marie-Jose
    [J]. SPEECH COMMUNICATION, 2015, 75 : 97 - 122
  • [6] Speaker-invariant and rhythm-sensitive representation of spoken words
    Minematsu, Nobuaki
    Ozaki, Yousuke
    Hirose, Keikichi
    Erickson, Donna
    [J]. 2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2013,
  • [7] Enhancing Adversarial Contrastive Learning via Adversarial Invariant Regularization
    Xu, Xilie
    Zhang, Jingfeng
    Liu, Feng
    Sugiyama, Masashi
    Kankanhalli, Mohan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] Speaker-invariant Psychological Stress Detection Using Attention-based Network
    Shin, Hyeon-Kyeong
    Han, Hyewon
    Byun, Kyungguen
    Kang, Hong-Goo
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 308 - 313
  • [9] Learning Signer-Invariant Representations with Adversarial Training
    Ferreira, Pedro M.
    Pernes, Diogo
    Rebelo, Ana
    Cardoso, Jaime S.
    [J]. TWELFTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2019), 2020, 11433
  • [10] ATTENTIVE ADVERSARIAL LEARNING FOR DOMAIN-INVARIANT TRAINING
    Meng, Zhong
    Li, Jinyu
    Gong, Yifan
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6740 - 6744