Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models

被引:6
|
作者
Yu, Chongchong [1 ]
Su, Xiaosu [1 ]
Qian, Zhaopeng [1 ]
机构
[1] Beijing Technol & Business Univ, Sch Artificial Intelligence, Beijing 100048, Peoples R China
关键词
Dysarthric speech recognition; pre-training and fine-tuning; multi-stage audio-visual fusion; SPEAKER ADAPTATION; PATTERNS; LIPS;
D O I
10.1109/TNSRE.2023.3262001
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.
引用
下载
收藏
页码:1912 / 1921
页数:10
相关论文
共 50 条
  • [21] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
  • [22] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
  • [23] Multi-pose lipreading and audio-visual speech recognition
    Estellers, Virginia
    Thiran, Jean-Philippe
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2012, : 1 - 23
  • [24] Audio-visual speech recognition by speechreading
    Zhang, XZ
    Mersereau, RM
    Clements, MA
    DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072
  • [25] Audio-visual Multi-channel Recognition of Overlapped Speech
    Yu, Jianwei
    Wu, Bo
    Gu, Rongzhi
    Zhang, Shi-Xiong
    Chen, Lianwu
    Xu, Yong
    Yu, Meng
    Su, Dan
    Yu, Dong
    Liu, Xunying
    Meng, Helen
    INTERSPEECH 2020, 2020, : 3496 - 3500
  • [26] Multi-pose lipreading and audio-visual speech recognition
    Virginia Estellers
    Jean-Philippe Thiran
    EURASIP Journal on Advances in Signal Processing, 2012
  • [27] Optimum integration weight for decision fusion audio-visual speech recognition
    Rajavel, R.
    Sathidevi, P. S.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2015, 10 (1-2) : 145 - 154
  • [28] Performance Improvement of Audio-Visual Speech Recognition with Optimal Reliability Fusion
    Tariquzzaman, Md
    Gyu, Song Min
    Young, Kim Jin
    You, Na Seung
    Rashid, M. A.
    2010 THE 3RD INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION (PACIIA2010), VOL III, 2010, : 216 - 219
  • [29] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
    Wei, Jie
    Hu, Guanyu
    Yang, Xinyu
    Luu, Anh Tuan
    Dong, Yizhuo
    INTERSPEECH 2022, 2022, : 1988 - 1992
  • [30] Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions
    Sad, Gonzalo D.
    Terissi, Lucas D.
    Gomez, Juan C.
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2016, 2017, 10125 : 360 - 367