On multi-domain training and adaptation of end-to-end RNN acoustic models for distant speech recognition

被引:13
|
作者
Mirsamadi, Seyedmandad [1 ]
Hansen, John H. L. [1 ]
机构
[1] Univ Texas Dallas, CRSS, Richardson, TX 75080 USA
关键词
distant speech recognition; recurrent neural network; multi-domain training;
D O I
10.21437/Interspeech.2017-398
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recognition of distant (far-field) speech is a challenge for ASR due to mismatch in recording conditions resulting from room reverberation and environment noise. Given the remarkable learning capacity of deep neural networks, there is increasing interest to address this problem by using a large corpus of reverberant far-field speech to train robust models. In this study. we explore how an end-to-end RNN acoustic model trained on speech from different rooms and acoustic conditions (different domains) achieves robustness to environmental variations. It is shown that the first hidden layer acts as a domain separator, projecting the data from different domains into different sub-spaces. The subsequent layers then use this encoded domain knowledge to map these features to final representations that are invariant to domain change. This mechanism is closely related to noise-aware or room-aware approaches which append manually-extracted domain signatures to the input features. Additionaly, we demonstrate how this understanding of the learning procedure provides useful guidance for model adaptation to new acoustic conditions. We present results based on AMI corpus to demonstrate the propagation of domain information in a deep RNN, and perform recognition experiments which indicate the role of encoded domain knowledge on training and adaptation of RNN acoustic models.
引用
收藏
页码:404 / 408
页数:5
相关论文
共 50 条
  • [31] Multi-Stream End-to-End Speech Recognition
    Li, Ruizhi
    Wang, Xiaofei
    Mallidi, Sri Harish
    Watanabe, Shinji
    Hori, Takaaki
    Hermansky, Hynek
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (646-655) : 646 - 655
  • [32] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [33] Provision of end-to-end QoS in heterogeneous multi-domain networks
    Wojciech Burakowski
    Andrzej Bęben
    Halina Tarasiuk
    Jarosław Śliwiński
    Robert Janowski
    Jordi Mongay Batalla
    Piotr Krawiec
    annals of telecommunications - annales des télécommunications, 2008, 63 : 559 - 577
  • [34] The self-adaptation of acoustic encoder in end-to-end automatic speech recognition under diverse acoustic scenes
    Liu Y.
    Zheng L.
    Li T.
    Zhang P.
    Shengxue Xuebao/Acta Acustica, 2023, 48 (06): : 1260 - 1268
  • [35] Text Only Domain Adaptation with Phoneme Guided Data Splicing for End-to-End Speech Recognition
    Wang, Wei
    Gong, Xun
    Shao, Hang
    Yang, Dongning
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 3347 - 3351
  • [36] Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
    Wongpatikaseree, Konlakorn
    Singkul, Sattaya
    Hnoohom, Narit
    Yuenyong, Sumeth
    BIG DATA AND COGNITIVE COMPUTING, 2022, 6 (03)
  • [37] Framewise Supervised Training towards End-to-End Speech Recognition Models: First Results
    Li, Mohan
    Cao, Yuanjiang
    Zhou, Weicong
    Liu, Min
    INTERSPEECH 2019, 2019, : 1641 - 1645
  • [38] End-to-End Speech Recognition Sequence Training With Reinforcement Learning
    Tjandra, Andros
    Sakti, Sakriani
    Nakamura, Satoshi
    IEEE ACCESS, 2019, 7 : 79758 - 79769
  • [39] Improved training for online end-to-end speech recognition systems
    Kim, Suyoun
    Seltzer, Michael L.
    Li, Jinyu
    Zhao, Rui
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2913 - 2917
  • [40] SEQUENCE NOISE INJECTED TRAINING FOR END-TO-END SPEECH RECOGNITION
    Saon, George
    Tuske, Zoltan
    Audhkhasi, Kartik
    Kingsbury, Brian
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6261 - 6265