Cascaded encoders for fine-tuning ASR models on overlapped speech

被引:0
|
作者
Rose, Richard [1 ]
Chang, Oscar [1 ]
Siohan, Olivier [1 ]
机构
[1] Google Inc, New York, NY 10011 USA
来源
关键词
multi-talker speech recognition;
D O I
10.21437/Interspeech.2023-1952
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multi-talker automatic speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. While MT-ASR models have typically been trained from scratch using simulated overlapping speech datasets, there is generally an underlying goal that these models also obtain state of the art performance on single speaker utterances as well. This implies that they must be competitive with the best available fine-tuned speech models that have been trained using massive datasets collected from a wide variety of task domains. This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. Experimental results show that the cascade configuration provides improved WER on overlapping speech utterances with respect to a baseline multi-talker model with minimal impact on the performance achievable by the foundation model on non-overlapping utterances.
引用
收藏
页码:3457 / 3461
页数:5
相关论文
共 50 条
  • [1] On Surgical Fine-tuning for Language Encoders
    Lodha, Abhilasha
    Belapurkar, Gayatri
    Chalkapurkar, Saloni
    Tao, Yuanming
    Ghosh, Reshmi
    Basu, Samyadeep
    Petrov, Dmitrii
    Srinivasan, Soundararajan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3105 - 3113
  • [2] Pretrained Speech Encoders and Efficient Fine-tuning Methods for Speech Translation: UPC at IWSLT 2022
    Tsiamas, Ioannis
    Gallego, Gerard, I
    Escolano, Carlos
    Fonollosa, Jose A. R.
    Costa-jussa, Marta R.
    PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2022), 2022, : 265 - 276
  • [3] Improving Speech Emotion Recognition via Fine-tuning ASR with Speaker Information
    Ta, Bao Thang
    Nguyen, Tung Lam
    Dang, Dinh Son
    Le, Nhat Minh
    Do, Van Hai
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1596 - 1601
  • [4] Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation
    Mujtaba, Dena
    Mahapatra, Nihar R.
    Arne, Megan
    Yaruss, J. Scott
    Herring, Caryn
    Bin, Jia
    INTERSPEECH 2024, 2024, : 1275 - 1279
  • [5] Fine-Tuning ASR models for Very Low-Resource Languages: A Study on Mvskoke
    Mainzinger, Julia
    Levow, Gina-Anne
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 94 - 100
  • [6] CHINESE ASR AND NER IMPROVEMENT BASED ON WHISPER FINE-TUNING
    Yang, Hao
    Zhang, Min
    Tao, Shimin
    Ma, Miaomiao
    Qin, Ying
    2023 25TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY, ICACT, 2023, : 213 - 217
  • [7] SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT
    Huang, Wen-Chin
    Wu, Chia-Hua
    Luo, Shang-Bao
    Chen, Kuan-Yu
    Wang, Hsin-Min
    Toda, Tomoki
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7343 - 7347
  • [8] Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement
    Yang, Hejung
    Kang, Hong-Goo
    INTERSPEECH 2023, 2023, : 814 - 818
  • [9] Fine-tuning pretrained transformer encoders for sequence-to-sequence learning
    Bao, Hangbo
    Dong, Li
    Wang, Wenhui
    Yang, Nan
    Piao, Songhao
    Wei, Furu
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (05) : 1711 - 1728
  • [10] Fine-tuning constraints on supergravity models
    Bastero-Gil, M
    Kane, GL
    King, SF
    PHYSICS LETTERS B, 2000, 474 (1-2) : 103 - 112