Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria

被引：3

作者：

Neumann, Thilo von ^{[1
]}

Kinoshita, Keisuke ^{[2
]}

Boeddeker, Christoph ^{[1
]}

Delcroix, Marc ^{[2
]}

Haeb-Umbach, Reinhold ^{[1
]}

机构：

[1] Paderborn Univ, D-33098 Paderborn, Germany

[2] NTT Commun Sci Labs, Kyoto 6190237, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Continuous speech separation; source separation; Graph-PIT; dynamic programming; permutation invariant training; ASSIGNMENT;

D O I：

10.1109/TASLP.2022.3228629

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Continuous Speech Separation (CSS) has been proposed to address speech overlaps during the analysis of realistic meeting-like conversations by eliminating any overlaps before further processing. CSS separates a recording of arbitrarily many speakers into a small number of overlap-free output channels, where each output channel may contain speech of multiple speakers. Often, a separation model is trained with Utterance-level Permutation Invariant Training (uPIT), which exclusively maps a speaker to an output channel, and applied in a sliding window approach called stitching. Recently, we introduced an alternative training scheme called Graph-PIT that teaches the separator to produce a speaker-shared output channel format without stitching. It can handle an arbitrary number of speakers as long as the number of overlapping speakers is never larger than the number of output channels. Models trained in this way are able to perform segment-less CSS, i.e., without stitching, and achieve comparable and often better separation quality than the conventional CSS with uPIT and stitching. In this contribution, we further investigate the Graph-PIT training scheme. We show in extended experiments that Graph-PIT also works in challenging reverberant conditions. We simplify the training schedule for Graph-PIT with the recently proposed Source Aggregated Signal-to-Distortion Ratio (SA-SDR) loss, which eliminates unfavorable properties of the previously used A-SDR loss to enable training with Graph-PIT from scratch. Furthermore, we introduce novel signal-level evaluation metrics for meeting scenarios, namely the source-aggregated scale- and convolution-invariant Signal-to-Distortion Ratio (SA-SI-SDR and SA-CI-SDR), which are generalizations of the commonly used SDR-based metrics for the CSS case.

引用

页码：576 / 589

页数：14

共 5 条

[1] ADAPTING SPEECH SEPARATION TO REAL-WORLD MEETINGS USING MIXTURE INVARIANT TRAINING
Sivaraman, Aswin
Wisdom, Scott
Erdogan, Hakan
Hershey, John R.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 686 - 690
[2] Overlap Aware Continuous Speech Separation without Permutation Invariant Training
Yu, Linfeng
Zhang, Wangyou
Li, Chenda
Qian, Yanmin
INTERSPEECH 2023, 2023, : 3512 - 3516
[3] LEVERAGING EVALUATION METRIC-RELATED TRAINING CRITERIA FOR SPEECH SUMMARIZATION
Lin, Shih-Hsiang
Chang, Yu-Mei
Liu, Jia-Wen
Chen, Berlin
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5314 - 5317
[4] Extractive speech summarization using evaluation metric-related training criteria
Chen, Berlin
Lin, Shih-Hsiang
Chang, Yu-Mei
Liu, Jia-Wen
INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (01) : 1 - 12
[5] CONVOLUTIVE TRANSFER FUNCTION INVARIANT SDR TRAINING CRITERIA FOR MULTI-CHANNEL REVERBERANT SPEECH SEPARATION
Boeddeker, Christoph
Zhang, Wangyou
Nakatani, Tomohiro
Kinoshita, Keisuke
Ochiai, Tsubasa
Delcroix, Marc
Kamo, Naoyuki
Qian, Yanmin
Haeb-Umbach, Reinhold
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8428 - 8432

← 1 →