VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation

被引:0
|
作者
Wang, Tianrui [1 ,2 ]
Zhou, Long [3 ]
Zhang, Ziqiang [4 ]
Wu, Yu [3 ]
Liu, Shujie [3 ]
Gaur, Yashesh [5 ]
Chen, Zhuo [5 ]
Li, Jinyu [5 ]
Wei, Furu [3 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Jiaotong Univ, Beijing Key Lab Adv Informat Sci & Network Technol, Beijing 100044, Peoples R China
[3] Microsoft Res Aisa, Beijing 100080, Peoples R China
[4] Univ Sci & Technol China, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei 230026, Peoples R China
[5] Microsoft Corp, Redmond, WA 98052 USA
关键词
Task analysis; Speech recognition; Codecs; Acoustics; Speech processing; Speech coding; Semantics; Language model; speech recognition; machine translation; speech synthesis; speech translation;
D O I
10.1109/TASLP.2024.3434425
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID), language IDs (LID), and LSTM-based acoustic embedding into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.
引用
收藏
页码:3709 / 3716
页数:8
相关论文
共 50 条
  • [31] Neural candidate-aware language models for speech recognition
    Tanaka, Tomohiro
    Masumura, Ryo
    Oba, Takanobu
    [J]. COMPUTER SPEECH AND LANGUAGE, 2021, 66
  • [32] Morpholexical and Discriminative Language Models for Turkish Automatic Speech Recognition
    Sak, Hasim
    Saraclar, Murat
    Gungor, Tunga
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (08): : 2341 - 2351
  • [33] SPEECH RECOGNITION - ACOUSTIC, PHONETIC AND FORMAL-LANGUAGE MODELS
    MERMELSTEIN, P
    LEVINSON, S
    [J]. BIOTELEMETRY, 1975, 2 (1-2) : 121 - 123
  • [34] Speaker Independent Speech Recognition Implementation with Adaptive Language Models
    Anukriti
    Tiwari, Sushant
    Chatterjee, Tanmay
    Bhattacharya, Mahua
    [J]. 2013 INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL AND BUSINESS INTELLIGENCE (ISCBI), 2013, : 7 - 10
  • [35] MIXED PRECISION QUANTIZATION OF TRANSFORMER LANGUAGE MODELS FOR SPEECH RECOGNITION
    Xu, Junhao
    Hu, Shoukang
    Yu, Jianwei
    Liu, Xunying
    Meng, Helen
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7383 - 7387
  • [36] Development of Language Models for Continuous Uzbek Speech Recognition System
    Mukhamadiyev, Abdinabi
    Mukhiddinov, Mukhriddin
    Khujayarov, Ilyos
    Ochilov, Mannon
    Cho, Jinsoo
    [J]. SENSORS, 2023, 23 (03)
  • [37] DOMAIN-AWARE NEURAL LANGUAGE MODELS FOR SPEECH RECOGNITION
    Liu, Linda
    Gu, Yile
    Gourav, Aditya
    Gandhe, Ankur
    Kalmane, Shashank
    Filimonov, Denis
    Rastrow, Ariya
    Bulyko, Ivan
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7373 - 7377
  • [38] LSTM-Based Language Models for Spontaneous Speech Recognition
    Medennikov, Ivan
    Bulusheva, Anna
    [J]. SPEECH AND COMPUTER, 2016, 9811 : 469 - 475
  • [39] Combining stochastic and linguistic language models for recognition of spontaneous speech
    Eckert, W
    Gallwitz, F
    Niemann, H
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 423 - 426
  • [40] Acoustic and Language Models Adaptation for Indonesian Spontaneous Speech Recognition
    Lestari, Dessi Puji
    Irfani, Angela
    [J]. 2015 2ND INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS: CONCEPTS, THEORY AND APPLICATIONS ICAICTA, 2015,