Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

被引:0
|
作者
Gimeno-Gomez, David [1 ]
Martinez-Hinarejos, Carlos-D. [1 ]
机构
[1] Univ Politecn Valencia, Pattern Recognit & Human Language Technol Res Ctr, Camino Vera S-N, Valencia 46022, Spain
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 11期
关键词
visual speech recognition; speaker adaptation; fine-tuning; Adapters; Spanish language; end-to-end architectures;
D O I
10.3390/app13116521
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique, the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Continuous Speech Recognition and Identification of the Speaker System
    Guffanti, Diego
    Martinez, Danilo
    Paladines, Jose
    Sarmiento, Andrea
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY & SYSTEMS (ICITS 2018), 2018, 721 : 767 - 776
  • [22] An evaluation of visual speech features for the tasks of speech and speaker recognition
    Lucey, S
    AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 260 - 267
  • [23] Channel and speaker adaptation techniques for robust speech recognition
    Chen, Jingdong
    Yao, Lei
    Huang, Taiyi
    Shengxue Xuebao/Acta Acustica, 1998, 23 (06): : 537 - 544
  • [24] A Combined Speaker Adaptation Method for Mandarin Speech Recognition
    徐向华
    朱杰
    JournalofShanghaiJiaotongUniversity, 2004, (04) : 21 - 24
  • [25] SPEAKER ADAPTATION USING SPECTRAL INTERPOLATION FOR SPEECH RECOGNITION
    SHINODA, K
    ISO, KI
    WATANABE, T
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE, 1994, 77 (10): : 1 - 11
  • [26] Adaptation of hidden Markov model for telephone speech recognition and speaker adaptation
    Natl Tsing Hua Univ, Hsinchu, Taiwan
    IEE Proc Vision Image Signal Proc, 3 (129-135):
  • [27] Adaptation of hidden Markov model for telephone speech recognition and speaker adaptation
    Chien, JT
    Wang, HC
    IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 1997, 144 (03): : 129 - 135
  • [28] Speaker adaptation techniques for speech recognition with a speaker-independent phonetic recognizer
    Kim, WG
    Jang, M
    COMPUTATIONAL INTELLIGENCE AND SECURITY, PT 1, PROCEEDINGS, 2005, 3801 : 95 - 100
  • [29] SPEAKER ADAPTATION OF RNN-BLSTM FOR SPEECH RECOGNITION BASED ON SPEAKER CODE
    Huang, Zhiying
    Tang, Jian
    Xue, Shaofei
    Dai, Lirong
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5305 - 5309
  • [30] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076