VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation

被引:0
|
作者
Wang, Tianrui [1 ,2 ]
Zhou, Long [3 ]
Zhang, Ziqiang [4 ]
Wu, Yu [3 ]
Liu, Shujie [3 ]
Gaur, Yashesh [5 ]
Chen, Zhuo [5 ]
Li, Jinyu [5 ]
Wei, Furu [3 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Jiaotong Univ, Beijing Key Lab Adv Informat Sci & Network Technol, Beijing 100044, Peoples R China
[3] Microsoft Res Aisa, Beijing 100080, Peoples R China
[4] Univ Sci & Technol China, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei 230026, Peoples R China
[5] Microsoft Corp, Redmond, WA 98052 USA
关键词
Task analysis; Speech recognition; Codecs; Acoustics; Speech processing; Speech coding; Semantics; Language model; speech recognition; machine translation; speech synthesis; speech translation;
D O I
10.1109/TASLP.2024.3434425
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID), language IDs (LID), and LSTM-based acoustic embedding into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.
引用
收藏
页码:3709 / 3716
页数:8
相关论文
共 50 条
  • [1] Streaming Models for Joint Speech Recognition and Translation
    Weller, Orion
    Sperber, Matthias
    Gollan, Christian
    Kluivers, Joris
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2533 - 2539
  • [2] Language Models for Tamil Speech Recognition System
    Saraswathi, S.
    Geetha, T. V.
    [J]. IETE TECHNICAL REVIEW, 2007, 24 (05) : 375 - 383
  • [3] Gaussian mixture language models for speech recognition
    Afify, Mohamed
    Siohan, Olivier
    Sarikaya, Ruhi
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 29 - +
  • [4] Improving language models for radiology speech recognition
    Paulett, John M.
    Langlotz, Curtis P.
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (01) : 53 - 58
  • [5] Discriminative training of language models for speech recognition
    Kuo, KHJ
    Fosler-Lussier, E
    Jiang, H
    Lee, CH
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 325 - 328
  • [6] BAYESIAN TRANSFORMER LANGUAGE MODELS FOR SPEECH RECOGNITION
    Xue, Boyang
    Yu, Jianwei
    Xu, Junhao
    Liu, Shansong
    Hu, Shoukang
    Ye, Zi
    Geng, Mengzhe
    Liu, Xunying
    Meng, Helen
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7378 - 7382
  • [7] Dirichlet Class Language Models for Speech Recognition
    Chien, Jen-Tzung
    Chueh, Chuang-Hua
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (03): : 482 - 495
  • [8] GEOGRAPHIC LANGUAGE MODELS FOR AUTOMATIC SPEECH RECOGNITION
    Xiao, Xiaoqiang
    Chen, Hong
    Zylak, Mark
    Sosa, Daniela
    Desu, Suma
    Krishnamoorthy, Mahesh
    Liu, Daben
    Paulik, Matthias
    Zhang, Yuchen
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6124 - 6128
  • [9] Syntactic Reanalysis in Language Models for Speech Recognition
    Twiefel, Johannes
    Hinaut, Xavier
    Wermter, Stefan
    [J]. 2017 THE SEVENTH JOINT IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING AND EPIGENETIC ROBOTICS (ICDL-EPIROB), 2017, : 215 - 220
  • [10] Factored Translation Models for improving a Speech into Sign Language Translation System
    Lopez-Ludena, V.
    San-Segundo, R.
    Cordoba, R.
    Ferreiros, J.
    Montero, J. M.
    Pardo, J. M.
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1616 - 1619