REDAT: ACCENT-INVARIANT REPRESENTATION FOR END-TO-END ASR BY DOMAIN ADVERSARIAL TRAINING WITH RELABELING

被引:8
|
作者
Hu, Hu [1 ,3 ]
Yang, Xuesong [2 ]
Raeesy, Zeynab [2 ]
Guo, Jinxi [2 ]
Keskin, Gokce [2 ]
Arsikere, Harish [2 ]
Rastrow, Ariya [2 ]
Stolcke, Andreas [2 ]
Maas, Roland [2 ]
Alexa, Amazon [2 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Amazon Alexa, Seattle, WA USA
[3] Amazon, Seattle, WA USA
关键词
Accent-invariance; end-to-end ASR; domain adversarial training; multi-accent ASR; RNN transducer;
D O I
10.1109/ICASSP39728.2021.9414291
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Accents mismatching is a critical problem for end-to-end ASR. This paper aims to address this problem by building an accent-robust RNN-T system with domain adversarial training (DAT). We unveil the magic behind DAT and provide, for the first time, a theoretical guarantee that DAT learns accent-invariant representations. We also prove that performing the gradient reversal in DAT is equivalent to minimizing the Jensen-Shannon divergence between domain output distributions. Motivated by the proof of equivalence, we introduce reDAT, a novel technique based on DAT, which relabels data using either unsupervised clustering or soft labels. Experiments on 23K hours of multi-accent data show that DAT achieves competitive results over accent-specific baselines on both native and non-native English accents but up to 13% relative WER reduction on unseen accents; our reDAT yields further improvements over DAT by 3% and 8% relatively on non-native accents of American and British English.
引用
收藏
页码:6408 / 6412
页数:5
相关论文
共 48 条
  • [1] AIPNET: GENERATIVE ADVERSARIAL PRE-TRAINING OF ACCENT-INVARIANT NETWORKS FOR END-TO-END SPEECH RECOGNITION
    Chen, Yi-Chen
    Yang, Zhaojun
    Yeh, Ching-Feng
    Jain, Mahaveer
    Seltzer, Michael L.
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6979 - 6983
  • [2] SPEAKER AND LANGUAGE AWARE TRAINING FOR END-TO-END ASR
    Bansal, Shubham
    Malhotra, Karan
    Ganapathy, Sriram
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 494 - 501
  • [3] End-to-End Speech Translation with Adversarial Training
    Li, Xuancai
    Chen, Kehai
    Zhao, Tiejun
    Yang, Muyun
    [J]. WORKSHOP ON AUTOMATIC SIMULTANEOUS TRANSLATION CHALLENGES, RECENT ADVANCES, AND FUTURE DIRECTIONS, 2020, : 10 - 14
  • [4] Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models
    Lu, Zhiyun
    Han, Wei
    Zhang, Yu
    Cao, Langliang
    [J]. INTERSPEECH 2021, 2021, : 3460 - 3464
  • [5] Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks
    Na, Hyeong-Ju
    Park, Jeong-Sik
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (18):
  • [6] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
    Gao, Qiang
    Wu, Haiwei
    Sun, Yanqing
    Duan, Yitao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
  • [7] Semi-supervised ASR by End-to-end Self-training
    Chen, Yang
    Wang, Weiran
    Wang, Chao
    [J]. INTERSPEECH 2020, 2020, : 2787 - 2791
  • [8] End-to-end Domain-Adversarial Voice Activity Detection
    Lavechin, Marvin
    Gill, Marie-Philippe
    Bousbib, Ruben
    Bredin, Herve
    Garcia-Perera, Leibny Paola
    [J]. INTERSPEECH 2020, 2020, : 3685 - 3689
  • [9] End-to-end Knowledge Triplet Extraction Combined with Adversarial Training
    Huang P.
    Zhao X.
    Fang Y.
    Zhu H.
    Xiao W.
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2019, 56 (12): : 2536 - 2548
  • [10] MINIMUM BAYES RISK TRAINING FOR END-TO-END SPEAKER-ATTRIBUTED ASR
    Kanda, Naoyuki
    Meng, Zhong
    Lu, Liang
    Gaur, Yashesh
    Wang, Xiaofei
    Chen, Zhuo
    Yoshioka, Takuya
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6503 - 6507