Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models

被引:6
|
作者
Yu, Chongchong [1 ]
Su, Xiaosu [1 ]
Qian, Zhaopeng [1 ]
机构
[1] Beijing Technol & Business Univ, Sch Artificial Intelligence, Beijing 100048, Peoples R China
关键词
Dysarthric speech recognition; pre-training and fine-tuning; multi-stage audio-visual fusion; SPEAKER ADAPTATION; PATTERNS; LIPS;
D O I
10.1109/TNSRE.2023.3262001
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.
引用
下载
收藏
页码:1912 / 1921
页数:10
相关论文
共 50 条
  • [1] A PRE-TRAINED AUDIO-VISUAL TRANSFORMER FOR EMOTION RECOGNITION
    Minh Tran
    Soleymani, Mohammad
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4698 - 4702
  • [2] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [3] Audio-visual fuzzy fusion for robust speech recognition
    Malcangi, M.
    Ouazzane, K.
    Patel, P.
    2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
  • [4] Weighting schemes for audio-visual fusion in speech recognition
    Glotin, H
    Vergyri, D
    Neti, C
    Potamianos, G
    Luettin, J
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 173 - 176
  • [5] Multistage information fusion for audio-visual speech recognition
    Chu, SM
    Libal, V
    Marcheret, E
    Neti, C
    Potamianos, G
    2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
  • [6] Audio-Visual Multilevel Fusion for Speech and Speaker Recognition
    Chetty, Girija
    Wagner, Michael
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 379 - 382
  • [7] Information Fusion Techniques in Audio-Visual Speech Recognition
    Karabalkan, H.
    Erdogan, H.
    2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
  • [8] Multi-Stage DNN Training for Automatic Recognition of Dysarthric Speech
    Yilmaz, Emre
    Ganzeboom, Mario
    Cucchiarini, Catia
    Strik, Helmer
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2685 - 2689
  • [9] FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
    Takashima, Yuki
    Nakashika, Toru
    Takiguchi, Tetsuya
    Ariki, Yasuo
    2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1411 - 1415
  • [10] Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
    Ahmad, Rehan
    Zubair, Syed
    Alquhayz, Hani
    Ditta, Allah
    SENSORS, 2019, 19 (23)