Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models

被引：6

作者：

Yu, Chongchong ^{[1
]}

Su, Xiaosu ^{[1
]}

Qian, Zhaopeng ^{[1
]}

机构：

[1] Beijing Technol & Business Univ, Sch Artificial Intelligence, Beijing 100048, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING | 2023年 / 31卷

关键词：

Dysarthric speech recognition; pre-training and fine-tuning; multi-stage audio-visual fusion; SPEAKER ADAPTATION; PATTERNS; LIPS;

D O I：

10.1109/TNSRE.2023.3262001

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.

引用

下载

页码：1912 / 1921

页数：10

共 50 条

[1] A PRE-TRAINED AUDIO-VISUAL TRANSFORMER FOR EMOTION RECOGNITION
Minh Tran
Soleymani, Mohammad
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4698 - 4702
[2] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[3] Audio-visual fuzzy fusion for robust speech recognition
Malcangi, M.
Ouazzane, K.
Patel, P.
2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
[4] Weighting schemes for audio-visual fusion in speech recognition
Glotin, H
Vergyri, D
Neti, C
Potamianos, G
Luettin, J
2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 173 - 176
[5] Multistage information fusion for audio-visual speech recognition
Chu, SM
Libal, V
Marcheret, E
Neti, C
Potamianos, G
2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
[6] Audio-Visual Multilevel Fusion for Speech and Speaker Recognition
Chetty, Girija
Wagner, Michael
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 379 - 382
[7] Information Fusion Techniques in Audio-Visual Speech Recognition
Karabalkan, H.
Erdogan, H.
2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
[8] Multi-Stage DNN Training for Automatic Recognition of Dysarthric Speech
Yilmaz, Emre
Ganzeboom, Mario
Cucchiarini, Catia
Strik, Helmer
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2685 - 2689
[9] FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
Takashima, Yuki
Nakashika, Toru
Takiguchi, Tetsuya
Ariki, Yasuo
2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1411 - 1415
[10] Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
Ahmad, Rehan
Zubair, Syed
Alquhayz, Hani
Ditta, Allah
SENSORS, 2019, 19 (23)

← 1 2 3 4 5 →