END-TO-END MONAURAL MULTI-SPEAKER ASR SYSTEM WITHOUT PRETRAINING

被引:0
|
作者
Chang, Xuankai [1 ,2 ]
Qian, Yanmin [1 ]
Yu, Kai [1 ]
Watanabe, Shinji [2 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, SpeechLab, Shanghai, Peoples R China
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
关键词
Cocktail party problem; multi-speaker speech recognition; end-to-end speech recognition; CTC; attention mechanism; SEPARATION;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, end-to-end models have become a popular approach as an alternative to traditional hybrid models in automatic speech recognition ( ASR). The multi-speaker speech separation and recognition task is a central task in cocktail party problem. In this paper, we present a state-of-the-art monaural multi-speaker end-to-end automatic speech recognition model. In contrast to previous studies on the monaural multi-speaker speech recognition, this end-to-end framework is trained to recognize multiple label sequences completely from scratch. The system only requires the speech mixture and corresponding label sequences, without needing any indeterminate supervisions obtained from non-mixture speech or corresponding labels/alignments. Moreover, we exploited using the individual attention module for each separated speaker and the scheduled sampling to further improve the performance. Finally, we evaluate the proposed model on the 2-speaker mixed speech generated from the WSJ corpus and the wsj0-2mix dataset, which is a speech separation and recognition benchmark. The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams. From the results, the proposed model leads to similar to 10.0% relative performance gains in terms of CER and WER respectively.
引用
收藏
页码:6256 / 6260
页数:5
相关论文
共 50 条
  • [21] Multi-Modal Data Augmentation for End-to-End ASR
    Renduchintala, Adithya
    Ding, Shuoyang
    Wiesner, Matthew
    Watanabe, Shinji
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2394 - 2398
  • [22] DOES SPEECH ENHANCEMENTWORK WITH END-TO-END ASR OBJECTIVES?: EXPERIMENTAL ANALYSIS OF MULTICHANNEL END-TO-END ASR
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    [J]. 2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [23] Speaker conditioned acoustic modeling for multi-speaker conversational ASR
    Chetupalli, Srikanth Raj
    Ganapathy, Sriram
    [J]. INTERSPEECH 2022, 2022, : 3834 - 3838
  • [24] HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS
    Chang, Xuankai
    Kanda, Naoyuki
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Yoshioka, Takuya
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6763 - 6767
  • [25] ENDPOINT DETECTION FOR STREAMING END-TO-END MULTI-TALKER ASR
    Lu, Liang
    Li, Jinyu
    Gong, Yifan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7312 - 7316
  • [26] BEAMNET: END-TO-END TRAINING OF A BEAMFORMER-SUPPORTED MULTI-CHANNEL ASR SYSTEM
    Heymann, Jahn
    Drude, Lukas
    Boeddeker, Christoph
    Hanebrink, Patrick
    Haeb-Umbach, Reinhold
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5325 - 5329
  • [27] LOW-FREQUENCY CHARACTER CLUSTERING FOR END-TO-END ASR SYSTEM
    Ito, Hitoshi
    Hagiwara, Aiko
    Ichiki, Manon
    Kobayakawa, Takeshi
    Mishima, Takeshi
    Sato, Shoei
    Kobayashi, Akio
    [J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 187 - 191
  • [28] Extremely Low Footprint End-to-End ASR System for Smart Device
    Gao, Zhifu
    Yao, Yiwu
    Zhang, Shiliang
    Yang, Jun
    Lei, Ming
    McLoughlin, Ian
    [J]. INTERSPEECH 2021, 2021, : 4548 - 4552
  • [29] An end-to-end continuous Kannada ASR system under uncontrolled environment
    G. Thimmaraja Yadava
    B. G. Nagaraja
    H. S. Jayanna
    [J]. Multimedia Tools and Applications, 2024, 83 : 7981 - 7994
  • [30] An end-to-end continuous Kannada ASR system under uncontrolled environment
    Yadava, G. Thimmaraja
    Nagaraja, B. G.
    Jayanna, H. S.
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7981 - 7994