Multi-Granularity Sequence Alignment Mapping for Encoder-Decoder Based End-to-End ASR

被引:1
|
作者
Tang, Jian [1 ]
Zhang, Jie [1 ]
Song, Yan [1 ]
McLoughlin, Ian [1 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China USTC, Natl Engn Lab Speech & Language Informat Proc, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech processing; Multi-granularity; sequence alignment; end-to-end ASR; encoder-decoder; post-inference; deep learning; SPEECH RECOGNITION; MODELS;
D O I
10.1109/TASLP.2021.3101921
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Encoder-decoder based automatic speech recognition (ASR) methods are increasingly popular due to their simplified processing stages and low reliance on prior knowledge. Conventional encoder-decoder based approaches usually learn a sequence-to-sequence mapping function from the source speech to target units (e.g., subwords, characters) in an end-to-end manner. However, it is still unclear how to choose the optimal target unit, or granularity of multiple units. In general, as increasing the information available for learning sequence-to-sequence mapping functions can improve modeling effectiveness, we therefore propose a multi-granularity sequence alignment (MGSA) approach. This aims to enhance cross-sequence interactions between different granularity units for both modeling and inference stages in the encoder-decoder based ASR. Specifically, a decoder module is designed to generate multi-granularity sequence predictions. We then exploit the latent alignment mapping among units having different levels of granularity, by utilizing the decoded multi-level sequences as input for model prediction. The cross-sequence interaction can also be employed to re-calibrate output probabilities in the proposed post-inference algorithm. Experimental results on both WSJ-80 hrs and Switchboard-300 hrs datasets show the superiority of the proposed method compared to traditional multi-task methods as well as to single granularity baseline systems.
引用
收藏
页码:2816 / 2828
页数:13
相关论文
共 50 条
  • [1] Encoder-Decoder Based Attractors for End-to-End Neural Diarization
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Xue, Yawen
    Garcia, Paola
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1493 - 1507
  • [2] End-to-End Deep Background Subtraction based on Encoder-Decoder Network
    Le, Duy H.
    Pham, Tuan, V
    [J]. PROCEEDINGS OF 2019 6TH NATIONAL FOUNDATION FOR SCIENCE AND TECHNOLOGY DEVELOPMENT (NAFOSTED) CONFERENCE ON INFORMATION AND COMPUTER SCIENCE (NICS), 2019, : 381 - 386
  • [3] SEQUENCE TRAINING OF ENCODER-DECODER MODEL USING POLICY GRADIENT FOR END-TO-END SPEECH RECOGNITION
    Karita, Shigeki
    Ogawa, Atsunori
    Delcroix, Marc
    Nakatani, Tomohiro
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5839 - 5843
  • [4] Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer
    Chen, Zhengyang
    Han, Bing
    Wang, Shuai
    Qian, Yanmin
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1636 - 1649
  • [5] End-to-End Trained CNN Encoder-Decoder Networks for Image Steganography
    Rehman, Atique ur
    Rahim, Rafia
    Nadeem, Shahroz
    ul Hussain, Sibt
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 723 - 729
  • [6] End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Xue, Yawen
    Nagamatsu, Kenji
    [J]. INTERSPEECH 2020, 2020, : 269 - 273
  • [7] Handwriting Trajectory Recovery using End-to-End Deep Encoder-Decoder Network
    Bhunia, Ayan Kumar
    Bhowmick, Abir
    Bhunia, Ankan Kumar
    Konwer, Aishik
    Banerjee, Prithaj
    Roy, Partha Pratim
    Pal, Umapada
    [J]. 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 3639 - 3644
  • [8] End-to-End Disparity Estimation with Multi-granularity Fully Convolutional Network
    Yang, Guorun
    Deng, Zhidong
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT III, 2017, 10636 : 238 - 248
  • [9] Neural JPEG: End-to-End Image Compression Leveraging a Standard JPEG Encoder-Decoder
    Mali, Ankur
    Ororbia, Alexander G.
    Kifer, Daniel
    Giles, C. Lee
    [J]. DCC 2022: 2022 DATA COMPRESSION CONFERENCE (DCC), 2022, : 471 - 471
  • [10] Nonautoregressive Encoder-Decoder Neural Framework for End-to-End Aspect-Based Sentiment Triplet Extraction
    Fei, Hao
    Ren, Yafeng
    Zhang, Yue
    Ji, Donghong
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (09) : 5544 - 5556