Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models

被引:6
|
作者
Yu, Chongchong [1 ]
Su, Xiaosu [1 ]
Qian, Zhaopeng [1 ]
机构
[1] Beijing Technol & Business Univ, Sch Artificial Intelligence, Beijing 100048, Peoples R China
关键词
Dysarthric speech recognition; pre-training and fine-tuning; multi-stage audio-visual fusion; SPEAKER ADAPTATION; PATTERNS; LIPS;
D O I
10.1109/TNSRE.2023.3262001
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.
引用
收藏
页码:1912 / 1921
页数:10
相关论文
共 50 条
  • [31] CATNet: Cross-modal fusion for audio-visual speech recognition
    Wang, Xingmei
    Mi, Jiachen
    Li, Boquan
    Zhao, Yixu
    Meng, Jiaxiang
    PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
  • [32] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [33] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [34] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [35] Audio-Visual Speech Modeling for Continuous Speech Recognition
    Dupont, Stephane
    Luettin, Juergen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
  • [36] Audio-visual speech fusion using coupled hidden Markov models
    Chu, Stephen M.
    Huang, Thomas S.
    2007 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-8, 2007, : 3911 - +
  • [37] Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech
    Yu, Jianwei
    Zhang, Shi-Xiong
    Wu, Bo
    Liu, Shansong
    Hu, Shoukang
    Geng, Mengzhe
    Liu, Xunying
    Meng, Helen
    Yu, Dong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2067 - 2082
  • [38] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [39] A Multi-tasking and Multi-stage Chinese Minority Pre-trained Language Model
    Li, Bin
    Weng, Yixuan
    Sun, Bin
    Li, Shutao
    Communications in Computer and Information Science, 2022, 1671 CCIS : 93 - 105
  • [40] Multi-stream asynchrony modeling for audio-visual speech recognition
    Lv, Guoyun
    Jiang, Dongmei
    Zhao, Rongchun
    Hou, Yunshu
    ISM 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2007, : 37 - 44