INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

被引:30
|
作者
Raj, Desh [1 ]
Denisov, Pavel [2 ]
Chen, Zhuo [3 ]
Erdogan, Hakan [4 ]
Huang, Zili [1 ]
He, Maokui [5 ,6 ]
Watanabe, Shinji [1 ]
Du, Jun [5 ,6 ]
Yoshioka, Takuya [3 ]
Luo, Yi
Kanda, Naoyuki [3 ]
Li, Jinyu [3 ]
Wisdom, Scott [4 ]
Hershey, John R. [4 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
[3] Microsoft Corp, Redmond, WA 98052 USA
[4] Google Res, Cambridge, MA USA
[5] Univ Sci & Technol China, Hefei, Peoples R China
[6] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA
关键词
Speech separation; diarization; speech recognition; multi-speaker;
D O I
10.1109/SLT48900.2021.9383556
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.
引用
下载
收藏
页码:897 / 904
页数:8
相关论文
共 50 条
  • [21] MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation
    Li, Xiyun
    Xu, Yong
    Yu, Meng
    Zhang, Shi-Xiong
    Xu, Jiaming
    Xu, Bo
    Yu, Dong
    INTERSPEECH 2021, 2021, : 1119 - 1123
  • [22] MULTI-SPEAKER AND CONTEXT-INDEPENDENT ACOUSTICAL CUES FOR AUTOMATIC SPEECH RECOGNITION
    ROSSI, M
    NISHINUMA, Y
    MERCIER, G
    SPEECH COMMUNICATION, 1983, 2 (2-3) : 215 - 217
  • [23] Silent versus modal multi-speaker speech recognition from ultrasound and video
    Ribeiro, Manuel Sam
    Eshky, Aciel
    Richmond, Korin
    Renals, Steve
    INTERSPEECH 2021, 2021, : 641 - 645
  • [24] Analysis of Oral Exams With Speaker Diarization and Speech Emotion Recognition: A Case Study
    Beccaro, Wesley
    Ramirez, Miguel Arjona
    Liaw, William
    Guimaraes, Heitor Rodrigues
    IEEE TRANSACTIONS ON EDUCATION, 2024, 67 (01) : 74 - 86
  • [25] SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability
    Jiang, Yanji
    Qiu, Youli
    Shen, Xueli
    Sun, Chuan
    Liu, Haitao
    APPLIED SCIENCES-BASEL, 2022, 12 (15):
  • [26] Multi-speaker Speech Separation under Reverberation Conditions Using Conv-Tasnet
    Wang, Chunxi
    Jia, Maoshen
    Zhang, Yanyan
    Li, Lu
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2023, 14 (04) : 694 - 700
  • [27] SPEAKER DIARIZATION AND SPEECH RECOGNITION IN THE SEMI-AUTOMATIZATION OF AUDIO DESCRIPTION: AN EXPLORATORY STUDY ON FUTURE POSSIBILITIES?
    Delgado, Hector
    Matamala, Anna
    Serrano, Javier
    CADERNOS DE TRADUCAO, 2015, 35 (02): : 308 - 324
  • [28] Real-time End-to-End Monaural Multi-speaker Speech Recognition
    Li, Song
    Ouyang, Beibei
    Tong, Fuchuan
    Liao, Dexin
    Li, Lin
    Hong, Qingyang
    INTERSPEECH 2021, 2021, : 3750 - 3754
  • [29] Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding
    Kim, Minsoo
    Jang, Gil-Jin
    Applied Sciences (Switzerland), 2024, 14 (18):
  • [30] Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network
    Shanfa Ke
    Ruimin Hu
    Xiaochen Wang
    Tingzhao Wu
    Gang Li
    Zhongyuan Wang
    Multimedia Tools and Applications, 2020, 79 : 32225 - 32241