Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

被引:0
|
作者
Liu, Yuchen [1 ,2 ]
Zhang, Jiajun [1 ,2 ]
Xiong, Hao [4 ]
Zhou, Long [1 ,2 ]
He, Zhongjun [4 ]
Wu, Hua [4 ]
Wang, Haifeng [4 ]
Zong, Chengqing [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, 10 Shangdi 10th St, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, 10 Shangdi 10th St, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, 10 Shangdi 10th St, Beijing, Peoples R China
[4] Baidu Inc, 10 Shangdi 10th St, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech-to-text translation (ST), which translates source language speech into target language text, has attracted intensive attention in recent years. Compared to the traditional pipeline system, the end-to-end ST model has potential benefits of lower latency, smaller model size, and less error propagation. However, it is notoriously difficult to implement such a model without transcriptions as intermediate. Existing works generally apply multi-task learning to improve translation quality by jointly training end-to-end ST along with automatic speech recognition (ASR). However, different tasks in this method cannot utilize information from each other, which limits the improvement. Other works propose a two-stage model where the second model can use the hidden state from the first one, but its cascade manner greatly affects the efficiency of training and inference process. In this paper, we propose a novel interactive attention mechanism which enables ASR and ST to perform synchronously and interactively in a single model. Specifically, the generation of transcriptions and translations not only relies on its previous outputs but also the outputs predicted in the other task. Experiments on TED speech translation corpora have shown that our proposed model can outperform strong baselines on the quality of speech translation and achieve better speech recognition performances as well.
引用
收藏
页码:8417 / 8424
页数:8
相关论文
共 50 条
  • [21] NAIST Simultaneous Speech-to-Text Translation System for IWSLT 2022
    Fukuda, Ryo
    Ko, Yuka
    Kano, Yasumasa
    Doi, Kosuke
    Tokuyama, Hirotaka
    Sakti, Sakriani
    Sudoh, Katsuhito
    Nakamura, Satoshi
    IWSLT 2022 - 19th International Conference on Spoken Language Translation, Proceedings of the Conference, 2022, : 286 - 292
  • [22] NAIST Simultaneous Speech-to-Text Translation System for IWSLT 2022
    Fukuda, Ryo
    Ko, Yuka
    Kano, Yasumasa
    Doi, Kosuke
    Tokuyama, Hirotaka
    Saktit, Sakriani
    Sudoh, Katsuhito
    Nakamura, Satoshi
    PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2022), 2022, : 286 - 292
  • [23] A Speech-to-Text Interface for MammoClass
    Roche, Ricardo Sousa
    Ferreira, Pedro
    Dutra, Ines
    Correia, Ricardo
    Salvini, Rogerio
    Burnside, Elizabeth
    2016 IEEE 29TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2016, : 1 - 6
  • [24] Learning Semantic Information from Machine Translation to Improve Speech-to-Text Translation
    Deng, Pan
    Zhang, Jie
    Zhou, Xinyuan
    Ye, Zhongyi
    Zhang, Weitai
    Cui, Jianwei
    Dai, Lirong
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 954 - 959
  • [25] Speech-to-text recognition in University English as a Foreign Language Learning
    Kate Tzu Ching Chen
    Education and Information Technologies, 2022, 27 : 9857 - 9875
  • [26] Speech-to-text recognition in University English as a Foreign Language Learning
    Chen, Kate Tzu Ching
    EDUCATION AND INFORMATION TECHNOLOGIES, 2022, 27 (07) : 9857 - 9875
  • [27] ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
    Le, Chenyang
    Qian, Yao
    Zhou, Long
    Liu, Shujie
    Qian, Yanmin
    Zeng, Michael
    Huang, Xuedong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [28] End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders
    Masumura, Ryo
    Sato, Hiroshi
    Tanaka, Tomohiro
    Moriya, Takafumi
    Ijima, Yusuke
    Oba, Takanobu
    INTERSPEECH 2019, 2019, : 1606 - 1610
  • [29] Revisiting End-to-End Speech-to-Text Translation From Scratch
    Zhang, Biao
    Haddow, Barry
    Sennrich, Rico
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [30] ANALYZING ASR PRETRAINING FOR LOW-RESOURCE SPEECH-TO-TEXT TRANSLATION
    Stoian, Mihaela C.
    Bansal, Sameer
    Goldwater, Sharon
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7909 - 7913