Streaming Models for Joint Speech Recognition and Translation

被引:0
|
作者
Weller, Orion [1 ]
Sperber, Matthias [2 ]
Gollan, Christian [2 ]
Kluivers, Joris [2 ]
机构
[1] Brigham Young Univ, Provo, UT 84602 USA
[2] Apple, Cupertino, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap, recent work has shown initial progress into the feasibility for end-to-end models to produce both of these outputs. However, all previous work has only looked at this problem from the consecutive perspective, leaving uncertainty on whether these approaches are effective in the more challenging streaming setting. We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches. We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders. Our evaluation across a range of metrics capturing accuracy, latency, and consistency shows that our end-to-end models are statistically similar to cascading models, while having half the number of parameters. We also find that both systems provide strong translation quality at low latency, keeping 99% of consecutive quality at a lag of just under a second.
引用
收藏
页码:2533 / 2539
页数:7
相关论文
共 50 条
  • [1] STREAMING JOINT SPEECH RECOGNITION AND DISFLUENCY DETECTION
    Futami, Hayato
    Tsunoo, Emiru
    Shibata, Kentaro
    Kashiwagi, Yosuke
    Okuda, Takao
    Arora, Siddhant
    Watanabe, Shinji
    [J]. arXiv, 2022,
  • [2] Direct Segmentation Models for Streaming Speech Translation
    Iranzo-Sanchez, Javier
    Pastor, Adria Gimenez
    Silvestre-Cerda, Joan Albert
    Baquero-Arnal, Pau
    Civera, Jorge
    Juan, Alfons
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2599 - 2611
  • [3] Joint Speech Translation and Named Entity Recognition
    Gaido, Marco
    Papi, Sara
    Negri, Matteo
    Turchi, Marco
    [J]. INTERSPEECH 2023, 2023, : 47 - 51
  • [4] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINT CTC-ATTENTION BASED MODELS
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 936 - 943
  • [5] Joint streaming model for backchannel prediction and automatic speech recognition
    Choi, Yong-Seok
    Bang, Jeong-Uk
    Kim, Seung Hi
    [J]. ETRI JOURNAL, 2024, 46 (01) : 118 - 126
  • [6] Streaming Multi-talker Speech Recognition with Joint Speaker Identification
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2021, 2021, : 1782 - 1786
  • [7] A COMPARISON OF STREAMING MODELS AND DATA AUGMENTATION METHODS FOR ROBUST SPEECH RECOGNITION
    Kim, Jiyeon
    Kumar, Mehul
    Gowda, Dhananjaya
    Garg, Abhinav
    Kim, Chanwoo
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 989 - 995
  • [8] JOINT LANGUAGE MODELS FOR AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING
    Bayer, Ali Orkan
    Riccardi, Giuseppe
    [J]. 2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 199 - 203
  • [9] Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
    Zhang, C.
    Li, B.
    Sainath, T. N.
    Strohman, T.
    Mavandadi, S.
    Chang, S.
    Haghani, P.
    [J]. INTERSPEECH 2022, 2022, : 3223 - 3227
  • [10] VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation
    Wang, Tianrui
    Zhou, Long
    Zhang, Ziqiang
    Wu, Yu
    Liu, Shujie
    Gaur, Yashesh
    Chen, Zhuo
    Li, Jinyu
    Wei, Furu
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3709 - 3716