EXTENDED GRAPH TEMPORAL CLASSIFICATION FOR MULTI-SPEAKER END-TO-END ASR

被引：0

作者：

Chang, Xuankai ^{[1
,2
]}

Moritz, Niko ^{[1
]}

Hori, Takaaki ^{[1
]}

Watanabe, Shinji ^{[2
]}

Le Roux, Jonathan ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

CTC; GTC; WFST; end-to-end ASR; multispeaker overlapped speech;

D O I：

10.1109/ICASSP43922.2022.9747375

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task. The transcriptions and speaker information of multi-speaker speech are represented by a graph, where the speaker information is associated with the transitions and ASR outputs with the nodes. Using GTC-e, multi-speaker ASR modelling becomes very similar to single-speaker ASR modeling, in that tokens by multiple speakers are recognized as a single merged sequence in chronological order. For evaluation, we perform experiments on a simulated multi-speaker speech dataset derived from LibriSpeech, obtaining promising results with performance close to classical benchmarks for the task.

引用

页码：7322 / 7326

页数：5

共 50 条

[1] END-TO-END MULTI-SPEAKER ASR WITH INDEPENDENT VECTOR ANALYSIS
Scheibler, Robin
Zhang, Wangyou
Chang, Xuankai
Watanabe, Shinji
Qian, Yanmin
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 496 - 501
[2] END-TO-END MONAURAL MULTI-SPEAKER ASR SYSTEM WITHOUT PRETRAINING
Chang, Xuankai
Qian, Yanmin
Yu, Kai
Watanabe, Shinji
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6256 - 6260
[3] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
Settle, Shane
Le Roux, Jonathan
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
[4] End-to-End Multilingual Multi-Speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
[J]. INTERSPEECH 2019, 2019, : 3755 - 3759
[5] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
Chang, Xuankai
Zhang, Wangyou
Qian, Yanmin
Le Roux, Jonathan
Watanabe, Shinji
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
[6] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
Cooper, Erica
Lai, Cheng-, I
Yasuda, Yusuke
Yamagishi, Junichi
[J]. INTERSPEECH 2020, 2020, : 3979 - 3983
[7] A Purely End-to-end System for Multi-speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
[J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630
[8] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
Denisov, Pavel
Ngoc Thang Vu
[J]. INTERSPEECH 2019, 2019, : 4425 - 4429
[9] Real-time End-to-End Monaural Multi-speaker Speech Recognition
Li, Song
Ouyang, Beibei
Tong, Fuchuan
Liao, Dexin
Li, Lin
Hong, Qingyang
[J]. INTERSPEECH 2021, 2021, : 3750 - 3754
[10] End-to-End Speaker-Attributed ASR with Transformer
Kanda, Naoyuki
Ye, Guoli
Gaur, Yashesh
Wang, Xiaofei
Meng, Zhong
Chen, Zhuo
Yoshioka, Takuya
[J]. INTERSPEECH 2021, 2021, : 4413 - 4417

← 1 2 3 4 5 →