END-TO-END MULTI-SPEAKER SPEECH RECOGNITION

被引:0
|
作者
Settle, Shane [2 ]
Le Roux, Jonathan [1 ]
Hori, Takaaki [1 ]
Watanabe, Shinji [1 ]
Hershey, John R. [1 ]
机构
[1] MERL, Cambridge, MA 02139 USA
[2] TTI Chicago, Chicago, IL 60637 USA
关键词
deep clustering; speaker-independent multi-talker speech separation; end-to-end asr; cocktail party problem;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Current advances in deep learning have resulted in a convergence of methods across a wide range of tasks, opening the door for tighter integration of modules that were previously developed and optimized in isolation. Recent ground-breaking works have produced end-to-end deep network methods for both speech separation and end-to-end automatic speech recognition (ASR). Speech separation methods such as deep clustering address the challenging cocktail-party problem of distinguishing multiple simultaneous speech signals. This is an enabling technology for real-world human machine interaction (HMI). However, speech separation requires ASR to interpret the speech for any HMI task. Likewise, ASR requires speech separation to work in an unconstrained environment. Although these two components can be trained in isolation and connected after the fact, this paradigm is likely to be sub-optimal, since it relies on artificially mixed data. In this paper, we develop the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals. The joint training framework synergistically adapts the separation and recognition to each other. As an additional benefit, it enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.
引用
收藏
页码:4819 / 4823
页数:5
相关论文
共 50 条
  • [41] TOWARDS END-TO-END UNSUPERVISED SPEECH RECOGNITION
    Liu, Alexander H.
    Hsu, Wei-Ning
    Auli, Michael
    Baevski, Alexei
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 221 - 228
  • [42] TRIGGERED ATTENTION FOR END-TO-END SPEECH RECOGNITION
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5666 - 5670
  • [43] An Overview of End-to-End Automatic Speech Recognition
    Wang, Dong
    Wang, Xiaodong
    Lv, Shaohe
    [J]. SYMMETRY-BASEL, 2019, 11 (08):
  • [44] END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS
    Petridis, Stavros
    Li, Zuwei
    Pantic, Maja
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2592 - 2596
  • [45] An End-to-End model for Vietnamese speech recognition
    Van Huy Nguyen
    [J]. 2019 IEEE - RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF), 2019, : 307 - 312
  • [46] SYNCHRONOUS TRANSFORMERS FOR END-TO-END SPEECH RECOGNITION
    Tian, Zhengkun
    Yi, Jiangyan
    Bai, Ye
    Tao, Jianhua
    Zhang, Shuai
    Wen, Zhengqi
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7884 - 7888
  • [47] End-to-End Speech Recognition of Tamil Language
    Changrampadi, Mohamed Hashim
    Shahina, A.
    Narayanan, M. Badri
    Khan, A. Nayeemulla
    [J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323
  • [48] PARAMETER UNCERTAINTY FOR END-TO-END SPEECH RECOGNITION
    Braun, Stefan
    Liu, Shih-Chii
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5636 - 5640
  • [49] End-to-End Speech Recognition For Arabic Dialects
    Seham Nasr
    Rehab Duwairi
    Muhannad Quwaider
    [J]. Arabian Journal for Science and Engineering, 2023, 48 : 10617 - 10633
  • [50] End-to-end Korean Digits Speech Recognition
    Roh, Jong-hyuk
    Cho, Kwantae
    Kim, Youngsam
    Cho, Sangrae
    [J]. 2019 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC): ICT CONVERGENCE LEADING THE AUTONOMOUS FUTURE, 2019, : 1137 - 1139