Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

被引:0
|
作者
Gimeno-Gomez, David [1 ]
Martinez-Hinarejos, Carlos-D. [1 ]
机构
[1] Univ Politecn Valencia, Pattern Recognit & Human Language Technol Res Ctr, Camino Vera S-N, Valencia 46022, Spain
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 11期
关键词
visual speech recognition; speaker adaptation; fine-tuning; Adapters; Spanish language; end-to-end architectures;
D O I
10.3390/app13116521
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique, the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Speech Recognition Using Speaker Adaptation by System Parameter Transformation
    Hao, Ying
    Fang, Ditang
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (01): : 63 - 68
  • [42] Speaker Adaptation Based on Nonlinear Spectral Transform for Speech Recognition
    Hayashi, Toyohiro
    Nankaku, Yoshihiko
    Lee, Akinobu
    Tokuda, Keiichi
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 542 - 545
  • [43] MAP speaker adaptation of state duration distributions for speech recognition
    Yoma, NB
    Sánchez, JS
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2002, 10 (07): : 443 - 450
  • [44] Unsupervised speaker adaptation for robust speech recognition in real environments
    Yamade, S
    Baba, A
    Yoshikawa, S
    Lee, A
    Saruwatari, H
    Shikano, K
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART II-ELECTRONICS, 2005, 88 (08): : 30 - 41
  • [45] Speaker adaptation of fuzzy-perceptron-based speech recognition
    Dept. of Elec. and Contr. Eng., National Chiao-Tung University, Hsinchu, Taiwan
    Int. J. Uncertainty Fuzziness Knowledge Based Syst., 1 (1-30):
  • [46] Experiments in speaker normalisation and adaptation for large vocabulary speech recognition
    Pye, D
    Woodland, PC
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1047 - 1050
  • [47] Confidence Score Based Conformer Speaker Adaptation for Speech Recognition
    Deng, Jiajun
    Xie, Xurong
    Wang, Tianzi
    Cui, Mingyu
    Xue, Boyang
    Jin, Zengrui
    Geng, Mengzhe
    Li, Guinan
    Liu, Xunying
    Meng, Helen
    INTERSPEECH 2022, 2022, : 2623 - 2627
  • [48] INVESTIGATIONS ON SPEAKER ADAPTATION OF LSTM RNN MODELS FOR SPEECH RECOGNITION
    Liu, Chaojun
    Wang, Yongqiang
    Kumar, Kshitiz
    Gong, Yifan
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5020 - 5024
  • [49] DYNAMIC FREQUENCY WARPING FOR SPEAKER ADAPTATION IN AUTOMATIC SPEECH RECOGNITION
    PALIWAL, KK
    AINSWORTH, WA
    JOURNAL OF PHONETICS, 1985, 13 (02) : 123 - 134
  • [50] SPEAKER ADAPTATION FOR MULTICHANNEL END-TO-END SPEECH RECOGNITION
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    Hori, Takaaki
    Hershey, John
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6707 - 6711