Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

被引：0

作者：

Liu, Yuchen ^{[1
,2
]}

Zhang, Jiajun ^{[1
,2
]}

Xiong, Hao ^{[4
]}

Zhou, Long ^{[1
,2
]}

He, Zhongjun ^{[4
]}

Wu, Hua ^{[4
]}

Wang, Haifeng ^{[4
]}

Zong, Chengqing ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, 10 Shangdi 10th St, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, 10 Shangdi 10th St, Beijing, Peoples R China

[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, 10 Shangdi 10th St, Beijing, Peoples R China

[4] Baidu Inc, 10 Shangdi 10th St, Beijing, Peoples R China

来源：

THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2020年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech-to-text translation (ST), which translates source language speech into target language text, has attracted intensive attention in recent years. Compared to the traditional pipeline system, the end-to-end ST model has potential benefits of lower latency, smaller model size, and less error propagation. However, it is notoriously difficult to implement such a model without transcriptions as intermediate. Existing works generally apply multi-task learning to improve translation quality by jointly training end-to-end ST along with automatic speech recognition (ASR). However, different tasks in this method cannot utilize information from each other, which limits the improvement. Other works propose a two-stage model where the second model can use the hidden state from the first one, but its cascade manner greatly affects the efficiency of training and inference process. In this paper, we propose a novel interactive attention mechanism which enables ASR and ST to perform synchronously and interactively in a single model. Specifically, the generation of transcriptions and translations not only relies on its previous outputs but also the outputs predicted in the other task. Experiments on TED speech translation corpora have shown that our proposed model can outperform strong baselines on the quality of speech translation and achieve better speech recognition performances as well.

引用

页码：8417 / 8424

页数：8

共 50 条

[21] NAIST Simultaneous Speech-to-Text Translation System for IWSLT 2022
Fukuda, Ryo
Ko, Yuka
Kano, Yasumasa
Doi, Kosuke
Tokuyama, Hirotaka
Sakti, Sakriani
Sudoh, Katsuhito
Nakamura, Satoshi
IWSLT 2022 - 19th International Conference on Spoken Language Translation, Proceedings of the Conference, 2022, : 286 - 292
[22] NAIST Simultaneous Speech-to-Text Translation System for IWSLT 2022
Fukuda, Ryo
Ko, Yuka
Kano, Yasumasa
Doi, Kosuke
Tokuyama, Hirotaka
Saktit, Sakriani
Sudoh, Katsuhito
Nakamura, Satoshi
PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2022), 2022, : 286 - 292
[23] A Speech-to-Text Interface for MammoClass
Roche, Ricardo Sousa
Ferreira, Pedro
Dutra, Ines
Correia, Ricardo
Salvini, Rogerio
Burnside, Elizabeth
2016 IEEE 29TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2016, : 1 - 6
[24] Learning Semantic Information from Machine Translation to Improve Speech-to-Text Translation
Deng, Pan
Zhang, Jie
Zhou, Xinyuan
Ye, Zhongyi
Zhang, Weitai
Cui, Jianwei
Dai, Lirong
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 954 - 959
[25] Speech-to-text recognition in University English as a Foreign Language Learning
Kate Tzu Ching Chen
Education and Information Technologies, 2022, 27 : 9857 - 9875
[26] Speech-to-text recognition in University English as a Foreign Language Learning
Chen, Kate Tzu Ching
EDUCATION AND INFORMATION TECHNOLOGIES, 2022, 27 (07) : 9857 - 9875
[27] ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Le, Chenyang
Qian, Yao
Zhou, Long
Liu, Shujie
Qian, Yanmin
Zeng, Michael
Huang, Xuedong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[28] End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders
Masumura, Ryo
Sato, Hiroshi
Tanaka, Tomohiro
Moriya, Takafumi
Ijima, Yusuke
Oba, Takanobu
INTERSPEECH 2019, 2019, : 1606 - 1610
[29] Revisiting End-to-End Speech-to-Text Translation From Scratch
Zhang, Biao
Haddow, Barry
Sennrich, Rico
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[30] ANALYZING ASR PRETRAINING FOR LOW-RESOURCE SPEECH-TO-TEXT TRANSLATION
Stoian, Mihaela C.
Bansal, Sameer
Goldwater, Sharon
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7909 - 7913

← 1 2 3 4 5 →