End-to-end Music-mixed Speech Recognition

被引:0
|
作者
Woo, Jeongwoo [1 ]
Mimura, Masato [1 ]
Yoshii, Kazuyoshi [1 ]
Kawahara, Tatsuya [1 ]
机构
[1] Kyoto Univ, Kyoto, Japan
关键词
SEPARATION;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic speech recognition (ASR) in multimedia content is one of the promising applications, but speech data in this kind of content are frequently mixed with background music, which is harmful for the performance of ASR. In this study, we propose a method for improving ASR with background music based on time-domain source separation. We utilize Conv-TasNet as a separation network, which has achieved state-of-the-art performance for multi-speaker source separation, to extract the speech signal from a speech-music mixture in the waveform domain. We also propose joint fine-tuning of a pre-trained Conv-TasNet front-end with an attention-based ASR back-end using both separation and ASR objectives. We evaluated our method through ASR experiments using speech data mixed with background music from a wide variety of Japanese animations. We show that time-domain speech-music separation drastically improves ASR performance of the back-end model trained with mixture data, and the joint optimization yielded a further significant WER reduction. The time-domain separation method outperformed a frequency-domain separation method, which reuses the phase information of the input mixture signal, both in simple cascading and joint training settings. We also demonstrate that our method works robustly for music interference from classical, jazz and popular genres.
引用
收藏
页码:800 / 804
页数:5
相关论文
共 50 条
  • [31] Hybrid end-to-end model for Kazakh speech recognition
    Mamyrbayev O.Z.
    Oralbekova D.O.
    Alimhan K.
    Nuranbayeva B.M.
    International Journal of Speech Technology, 2023, 26 (02) : 261 - 270
  • [32] Adapting End-to-End Speech Recognition for Readable Subtitles
    Liu, Danni
    Niehues, Jan
    Spanakis, Gerasimos
    17TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2020), 2020, : 247 - 256
  • [33] End-to-End Speech Emotion Recognition With Gender Information
    Sun, Ting-Wei
    IEEE ACCESS, 2020, 8 (08): : 152423 - 152438
  • [34] Residual Language Model for End-to-end Speech Recognition
    Tsunoo, Emiru
    Kashiwagi, Yosuke
    Narisetty, Chaitanya
    Watanabe, Shinji
    INTERSPEECH 2022, 2022, : 3899 - 3903
  • [35] DEEP CONTEXT: END-TO-END CONTEXTUAL SPEECH RECOGNITION
    Pundak, Golan
    Sainath, Tara N.
    Prabhavalkar, Rohit
    Kannan, Anjuli
    Zhao, Ding
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 418 - 425
  • [36] FunASR: A Fundamental End-to-End Speech Recognition Toolkit
    Gao, Zhifu
    Li, Zerui
    Wang, Jiaming
    Luo, Haoneng
    Shi, Xian
    Chen, Mengzhe
    Li, Yabin
    Zuo, Lingyun
    Du, Zhihao
    Zhang, Shiliang
    INTERSPEECH 2023, 2023, : 1593 - 1597
  • [37] End-to-end Speech-to-Punctuated-Text Recognition
    Nozaki, Jumon
    Kawahara, Tatsuya
    Ishizuka, Kenkichi
    Hashimoto, Taiichi
    INTERSPEECH 2022, 2022, : 1811 - 1815
  • [38] EXPLORING NEURAL TRANSDUCERS FOR END-TO-END SPEECH RECOGNITION
    Battenberg, Eric
    Chen, Jitong
    Child, Rewon
    Coates, Adam
    Gaur, Yashesh
    Li, Yi
    Liu, Hairong
    Satheesh, Sanjeev
    Sriram, Anuroop
    Zhu, Zhenyao
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 206 - 213
  • [39] Semi-Supervised End-to-End Speech Recognition
    Karita, Shigeki
    Watanabe, Shinji
    Iwata, Tomoharu
    Ogawa, Atsunori
    Delcroix, Marc
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2 - 6
  • [40] End-to-End Neural Segmental Models for Speech Recognition
    Tang, Hao
    Lu, Liang
    Kong, Lingpeng
    Gimpel, Kevin
    Livescu, Karen
    Dyer, Chris
    Smith, Noah A.
    Renals, Steve
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1254 - 1264