End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning

被引:12
|
作者
Denisov, Pavel [1 ]
Ngoc Thang Vu [1 ]
机构
[1] Univ Stuttgart, Inst Nat Language Proc IMS, Stuttgart, Germany
来源
关键词
end-to-end asr; overlapped speech;
D O I
10.21437/Interspeech.2019-1130
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech. We propose to train an end-to-end system conditioned on speaker embeddings and further improved by transfer learning from clean speech. This proposed framework does not require any parallel non-overlapped speech materials and is independent of the number of speakers. Our experimental results on overlapped speech datasets show that joint conditioning on speaker embeddings and transfer learning significantly improves the ASR performance.
引用
收藏
页码:4425 / 4429
页数:5
相关论文
共 50 条
  • [41] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
  • [42] Speech Emotion Recognition Using Transfer Learning: Integration of Advanced Speaker Embeddings and Image Recognition Models
    Jakubec, Maros
    Lieskovska, Eva
    Jarina, Roman
    Spisiak, Michal
    Kasak, Peter
    Applied Sciences (Switzerland), 2024, 14 (21):
  • [43] Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding
    Kim, Minsoo
    Jang, Gil-Jin
    Applied Sciences (Switzerland), 2024, 14 (18):
  • [44] Robust End-to-End Speaker Verification Using EEG
    Han, Yan
    Krishna, Gautam
    Tran, Co
    Carnahan, Mason
    Tewfik, Ahmed H.
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 1170 - 1174
  • [45] Gammatonegram representation for end-to-end dysarthric speech processing tasks: speech recognition, speaker identification, and intelligibility assessment
    Aref Farhadipour
    Hadi Veisi
    Iran Journal of Computer Science, 2024, 7 (2) : 311 - 324
  • [46] Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
    Li, Tao
    Wang, Xinsheng
    Xie, Qicong
    Wang, Zhichao
    Jiang, Mingqi
    Xie, Lei
    INTERSPEECH 2022, 2022, : 5498 - 5502
  • [47] END-TO-END OVERLAPPED SPEECH DETECTION AND SPEAKER COUNTING WITH RAW WAVEFORM
    Zhang, Wangyou
    Sun, Man
    Wang, Lan
    Qian, Yanmin
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 660 - 666
  • [48] DIVE: END-TO-END SPEECH DIARIZATION VIA ITERATIVE SPEAKER EMBEDDING
    Zeghidour, Neil
    Teboul, Olivier
    Grangier, David
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 702 - 709
  • [49] End-to-End Chinese Speaker Identification
    Yu, Dian
    Zhou, Ben
    Yu, Dong
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2274 - 2285
  • [50] END-TO-END SPEAKER DIARIZATION CONDITIONED ON SPEECH ACTIVITY AND OVERLAP DETECTION
    Takashima, Yuki
    Fujita, Yusuke
    Watanabe, Shinji
    Horiguchi, Shota
    Garcia, Paola
    Nagamatsu, Kenji
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 849 - 856