EXTENDED GRAPH TEMPORAL CLASSIFICATION FOR MULTI-SPEAKER END-TO-END ASR

被引：0

作者：

Chang, Xuankai ^{[1
,2
]}

Moritz, Niko ^{[1
]}

Hori, Takaaki ^{[1
]}

Watanabe, Shinji ^{[2
]}

Le Roux, Jonathan ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

CTC; GTC; WFST; end-to-end ASR; multispeaker overlapped speech;

D O I：

10.1109/ICASSP43922.2022.9747375

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task. The transcriptions and speaker information of multi-speaker speech are represented by a graph, where the speaker information is associated with the transitions and ASR outputs with the nodes. Using GTC-e, multi-speaker ASR modelling becomes very similar to single-speaker ASR modeling, in that tokens by multiple speakers are recognized as a single merged sequence in chronological order. For evaluation, we perform experiments on a simulated multi-speaker speech dataset derived from LibriSpeech, obtaining promising results with performance close to classical benchmarks for the task.

引用

页码：7322 / 7326

页数：5

共 50 条

[41] A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR
Li, Bo
Gulati, Anmol
Yu, Jiahui
Sainath, Tara N.
Chiu, Chung-Cheng
Narayanan, Arun
Chang, Shuo-Yiin
Pang, Ruoming
He, Yanzhang
Qin, James
Han, Wei
Liang, Qiao
Zhang, Yu
Strohman, Trevor
Wu, Yonghui
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5634 - 5638
[42] INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR
Van Tung Pham
Xu, Haihua
Khassanov, Yerbolat
Zeng, Zhiping
Chng, Eng Siong
Ni, Chongjia
Ma, Bin
Li, Haizhou
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7059 - 7063
[43] End-to-end Multi-modal Video Temporal Grounding
Chen, Yi-Wen
Tsai, Yi-Hsuan
Yang, Ming-Hsuan
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[44] SPEAKER ADAPTATION FOR END-TO-END CTC MODELS
Li, Ke
Li, Jinyu
Zhao, Yong
Kumar, Kshitiz
Gong, Yifan
[J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 542 - 549
[45] GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION
Wan, Li
Wang, Quan
Papir, Alan
Moreno, Ignacio Lopez
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4879 - 4883
[46] A study on end-to-end speaker diarization system using single-label classification
Jung, Jaehee
Kim, Wooil
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2023, 42 (06): : 536 - 543
[47] End-to-end Keywords Spotting Based on Connectionist Temporal Classification for Mandarin
Bai, Ye
Yi, Jiangyan
Ni, Hao
Wen, Zhengqi
Liu, Bin
Li, Ya
Tao, Jianhua
[J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[48] TOWARDS END-TO-END SPEAKER DIARIZATION WITH GENERALIZED NEURAL SPEAKER CLUSTERING
Zhang, Chunlei
Shi, Jiatong
Weng, Chao
Yu, Meng
Yu, Dong
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8372 - 8376
[49] End-to-end Learning for Graph Decomposition
Song, Jie
Andres, Bjoern
Black, Michael J.
Hilliges, Otmar
Tang, Siyu
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10092 - 10101
[50] End-To-End Graph-Based Deep Semi-Supervised Learning with Extended Graph Laplacian
Wang, Zihao
Tu, Enmei
Zhou, Meng
Yang, Jie
[J]. 2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, : 5948 - 5953

← 1 2 3 4 5 →