Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

被引:1
|
作者
Deng, Jiajun [1 ]
Xie, Xurong [2 ]
Wang, Tianzi [1 ]
Cui, Mingyu [1 ]
Xue, Boyang [1 ]
Jin, Zengrui [1 ]
Li, Guinan [1 ]
Hu, Shujie [1 ]
Liu, Xunying [1 ]
机构
[1] Chinese Univ Hong Kong, Cent Ave, Hong Kong, Peoples R China
[2] Chinese Acad Sci, Inst Software, Beijing 100045, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Adaptation models; Hidden Markov models; Data models; Acoustics; Transformers; Switches; Task analysis; Speech recognition; speaker adaptation; confidence score estimation; bayesian learning; conformer; NEURAL-NETWORK; TRANSFORMATIONS; NORMALIZATION; TRANSCRIPTION;
D O I
10.1109/TASLP.2023.3250842
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets.
引用
收藏
页码:1175 / 1190
页数:16
相关论文
共 50 条
  • [1] Confidence Score Based Conformer Speaker Adaptation for Speech Recognition
    Deng, Jiajun
    Xie, Xurong
    Wang, Tianzi
    Cui, Mingyu
    Xue, Boyang
    Jin, Zengrui
    Geng, Mengzhe
    Li, Guinan
    Liu, Xunying
    Meng, Helen
    [J]. INTERSPEECH 2022, 2022, : 2623 - 2627
  • [2] SPEAKER MODEL ADAPTATION BASED ON CONFIDENCE SCORE
    Mengusoglu, Erhan
    [J]. TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2015, 22 (04): : 873 - 878
  • [3] A confidence-score based unsupervised map adaptation for speech recognition
    Wang, DG
    Narayanan, SS
    [J]. THIRTY-SIXTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS - CONFERENCE RECORD, VOLS 1 AND 2, CONFERENCE RECORD, 2002, : 222 - 226
  • [4] Speaker clustering and transformation for speaker adaptation in speech recognition systems
    Padmanabhan, M
    Bahl, LR
    Nahamoo, D
    Picheny, MA
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (01): : 71 - 77
  • [5] Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems
    Siniscalchi, Sabato Marco
    Li, Jinyu
    Lee, Chin-Hui
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (10): : 2152 - 2161
  • [6] Speaker adaptation for hybrid MMI/connectionist speech recognition systems
    Rottland, J
    Neukirchen, C
    Rigoll, G
    [J]. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 465 - 468
  • [7] Discriminative speaker adaptation in Persian continuous speech recognition systems
    Pirhosseinloo, Shadi
    Ganj, Farshad Almas
    [J]. 4TH INTERNATIONAL CONFERENCE OF COGNITIVE SCIENCE, 2012, 32 : 296 - 301
  • [8] PREDICTIVE SPEAKER ADAPTATION IN SPEECH RECOGNITION
    COX, S
    [J]. COMPUTER SPEECH AND LANGUAGE, 1995, 9 (01): : 1 - 17
  • [9] Automatic confidence score mapping for adapted speech recognition systems
    Sankar, A
    Kannan, A
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 213 - 216
  • [10] Confidence-driven iterative speaker adaptation in transcription-mode speech recognition
    Ou, JZ
    Chen, KJ
    Li, ZG
    [J]. 2001 INTERNATIONAL CONFERENCES ON INFO-TECH AND INFO-NET PROCEEDINGS, CONFERENCE A-G: INFO-TECH & INFO-NET: A KEY TO BETTER LIFE, 2001, : B665 - B670