OPTIMIZING ALIGNMENT OF SPEECH AND LANGUAGE LATENT SPACES FOR END-TO-END SPEECH RECOGNITION AND UNDERSTANDING

被引:3
|
作者
Wang, Wei [1 ,2 ]
Ren, Shuo [2 ]
Qian, Yao [2 ]
Liu, Shujie [2 ]
Shi, Yu [2 ]
Qian, Yanmin [1 ]
Zeng, Michael [2 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence,X LANCE Lab, Shanghai, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
关键词
speech recognition; multi-modality; end-to-end;
D O I
10.1109/ICASSP43922.2022.9747760
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text encoder for leveraging extensive text data and thus capture more context-aware linguistic information. However, this approach brings a mismatch problem between the speech encoder and the text encoder due to the different units used for modeling. In this paper, we propose an embedding aligner and modality switch training to better align the speech and text latent spaces. The embedding aligner is a shared linear projection between text encoder and speech encoder trained by masked language modeling (MLM) loss and connectionist temporal classification (CTC), respectively. The modality switch training randomly swaps speech and text embeddings based on the forced alignment result to learn a joint representation space. Experimental results show that our proposed approach achieves a relative 14% to 19% word error rate (WER) reduction on LIBRISPEECH ASR task. We further verify its effectiveness on spoken language understanding (SLU), i.e., an absolute 2.5% to 2.8% F1 score improvement on SNIPS slot filling task.
引用
收藏
页码:7802 / 7806
页数:5
相关论文
共 50 条
  • [1] End-to-End Speech Recognition of Tamil Language
    Changrampadi, Mohamed Hashim
    Shahina, A.
    Narayanan, M. Badri
    Khan, A. Nayeemulla
    [J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323
  • [2] Residual Language Model for End-to-end Speech Recognition
    Tsunoo, Emiru
    Kashiwagi, Yosuke
    Narisetty, Chaitanya
    Watanabe, Shinji
    [J]. INTERSPEECH 2022, 2022, : 3899 - 3903
  • [3] Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language
    Matsuura, Kohei
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2622 - 2628
  • [4] End-to-end speech recognition with Alignment RNN-Transducer
    Tian, Ying
    Li, Zerui
    Liu, Min
    Ouchi, Kazushige
    Yan, Long
    Zhao, Dan
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [5] TOWARDS LANGUAGE-UNIVERSAL END-TO-END SPEECH RECOGNITION
    Kim, Suyoun
    Seltzer, Michael L.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4914 - 4918
  • [6] End-to-End Large Vocabulary Speech Recognition for the Serbian Language
    Popovic, Branislav
    Pakoci, Edvin
    Pekar, Darko
    [J]. SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 343 - 352
  • [7] LEVERAGING LANGUAGE ID IN MULTILINGUAL END-TO-END SPEECH RECOGNITION
    Waters, Austin
    Gaur, Neeraj
    Haghani, Parisa
    Moreno, Pedro
    Qu, Zhongdi
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 928 - 935
  • [8] Noise Robust End-to-End Speech Recognition For Bangla Language
    Sumit, Sakhawat Hosain
    Al Muntasir, Tareq
    Zaman, M. M. Arefin
    Nandi, Rabindra Nath
    Sourov, Tanvir
    [J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [9] EXPLICIT ALIGNMENT OF TEXT AND SPEECH ENCODINGS FOR ATTENTION-BASED END-TO-END SPEECH RECOGNITION
    Drexler, Jennifer
    Glass, James
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 913 - 919
  • [10] END-TO-END MULTIMODAL SPEECH RECOGNITION
    Palaskar, Shruti
    Sanabria, Ramon
    Metze, Florian
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778