EXTENDED GRAPH TEMPORAL CLASSIFICATION FOR MULTI-SPEAKER END-TO-END ASR

被引:0
|
作者
Chang, Xuankai [1 ,2 ]
Moritz, Niko [1 ]
Hori, Takaaki [1 ]
Watanabe, Shinji [2 ]
Le Roux, Jonathan [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
CTC; GTC; WFST; end-to-end ASR; multispeaker overlapped speech;
D O I
10.1109/ICASSP43922.2022.9747375
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task. The transcriptions and speaker information of multi-speaker speech are represented by a graph, where the speaker information is associated with the transitions and ASR outputs with the nodes. Using GTC-e, multi-speaker ASR modelling becomes very similar to single-speaker ASR modeling, in that tokens by multiple speakers are recognized as a single merged sequence in chronological order. For evaluation, we perform experiments on a simulated multi-speaker speech dataset derived from LibriSpeech, obtaining promising results with performance close to classical benchmarks for the task.
引用
收藏
页码:7322 / 7326
页数:5
相关论文
共 50 条
  • [41] A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR
    Li, Bo
    Gulati, Anmol
    Yu, Jiahui
    Sainath, Tara N.
    Chiu, Chung-Cheng
    Narayanan, Arun
    Chang, Shuo-Yiin
    Pang, Ruoming
    He, Yanzhang
    Qin, James
    Han, Wei
    Liang, Qiao
    Zhang, Yu
    Strohman, Trevor
    Wu, Yonghui
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5634 - 5638
  • [42] INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR
    Van Tung Pham
    Xu, Haihua
    Khassanov, Yerbolat
    Zeng, Zhiping
    Chng, Eng Siong
    Ni, Chongjia
    Ma, Bin
    Li, Haizhou
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7059 - 7063
  • [43] End-to-end Multi-modal Video Temporal Grounding
    Chen, Yi-Wen
    Tsai, Yi-Hsuan
    Yang, Ming-Hsuan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [44] SPEAKER ADAPTATION FOR END-TO-END CTC MODELS
    Li, Ke
    Li, Jinyu
    Zhao, Yong
    Kumar, Kshitiz
    Gong, Yifan
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 542 - 549
  • [45] GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION
    Wan, Li
    Wang, Quan
    Papir, Alan
    Moreno, Ignacio Lopez
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4879 - 4883
  • [46] A study on end-to-end speaker diarization system using single-label classification
    Jung, Jaehee
    Kim, Wooil
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2023, 42 (06): : 536 - 543
  • [47] End-to-end Keywords Spotting Based on Connectionist Temporal Classification for Mandarin
    Bai, Ye
    Yi, Jiangyan
    Ni, Hao
    Wen, Zhengqi
    Liu, Bin
    Li, Ya
    Tao, Jianhua
    [J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [48] TOWARDS END-TO-END SPEAKER DIARIZATION WITH GENERALIZED NEURAL SPEAKER CLUSTERING
    Zhang, Chunlei
    Shi, Jiatong
    Weng, Chao
    Yu, Meng
    Yu, Dong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8372 - 8376
  • [49] End-to-end Learning for Graph Decomposition
    Song, Jie
    Andres, Bjoern
    Black, Michael J.
    Hilliges, Otmar
    Tang, Siyu
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10092 - 10101
  • [50] End-To-End Graph-Based Deep Semi-Supervised Learning with Extended Graph Laplacian
    Wang, Zihao
    Tu, Enmei
    Zhou, Meng
    Yang, Jie
    [J]. 2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, : 5948 - 5953