END-TO-END MULTI-SPEAKER SPEECH RECOGNITION

被引:0
|
作者
Settle, Shane [2 ]
Le Roux, Jonathan [1 ]
Hori, Takaaki [1 ]
Watanabe, Shinji [1 ]
Hershey, John R. [1 ]
机构
[1] MERL, Cambridge, MA 02139 USA
[2] TTI Chicago, Chicago, IL 60637 USA
关键词
deep clustering; speaker-independent multi-talker speech separation; end-to-end asr; cocktail party problem;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Current advances in deep learning have resulted in a convergence of methods across a wide range of tasks, opening the door for tighter integration of modules that were previously developed and optimized in isolation. Recent ground-breaking works have produced end-to-end deep network methods for both speech separation and end-to-end automatic speech recognition (ASR). Speech separation methods such as deep clustering address the challenging cocktail-party problem of distinguishing multiple simultaneous speech signals. This is an enabling technology for real-world human machine interaction (HMI). However, speech separation requires ASR to interpret the speech for any HMI task. Likewise, ASR requires speech separation to work in an unconstrained environment. Although these two components can be trained in isolation and connected after the fact, this paradigm is likely to be sub-optimal, since it relies on artificially mixed data. In this paper, we develop the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals. The joint training framework synergistically adapts the separation and recognition to each other. As an additional benefit, it enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.
引用
收藏
页码:4819 / 4823
页数:5
相关论文
共 50 条
  • [1] End-to-End Multilingual Multi-Speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    [J]. INTERSPEECH 2019, 2019, : 3755 - 3759
  • [2] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
  • [3] A Purely End-to-end System for Multi-speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    [J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630
  • [4] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    [J]. INTERSPEECH 2019, 2019, : 4425 - 4429
  • [5] MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 237 - 244
  • [6] Real-time End-to-End Monaural Multi-speaker Speech Recognition
    Li, Song
    Ouyang, Beibei
    Tong, Fuchuan
    Liao, Dexin
    Li, Lin
    Hong, Qingyang
    [J]. INTERSPEECH 2021, 2021, : 3750 - 3754
  • [7] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
    Cooper, Erica
    Lai, Cheng-, I
    Yasuda, Yusuke
    Yamagishi, Junichi
    [J]. INTERSPEECH 2020, 2020, : 3979 - 3983
  • [8] END-TO-END MULTI-SPEAKER ASR WITH INDEPENDENT VECTOR ANALYSIS
    Scheibler, Robin
    Zhang, Wangyou
    Chang, Xuankai
    Watanabe, Shinji
    Qian, Yanmin
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 496 - 501
  • [9] SPEAKER ADAPTATION FOR MULTICHANNEL END-TO-END SPEECH RECOGNITION
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    Hori, Takaaki
    Hershey, John
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6707 - 6711
  • [10] EXTENDED GRAPH TEMPORAL CLASSIFICATION FOR MULTI-SPEAKER END-TO-END ASR
    Chang, Xuankai
    Moritz, Niko
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7322 - 7326