Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models

被引：6

作者：

Yu, Chongchong ^{[1
]}

Su, Xiaosu ^{[1
]}

Qian, Zhaopeng ^{[1
]}

机构：

[1] Beijing Technol & Business Univ, Sch Artificial Intelligence, Beijing 100048, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING | 2023年 / 31卷

关键词：

Dysarthric speech recognition; pre-training and fine-tuning; multi-stage audio-visual fusion; SPEAKER ADAPTATION; PATTERNS; LIPS;

D O I：

10.1109/TNSRE.2023.3262001

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.

引用

下载

页码：1912 / 1921

页数：10

共 50 条

[11] Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Duan, Haoyi
Xia, Yan
Zhou, Mingze
Tang, Li
Zhu, Jieming
Zhao, Zhou
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[12] DBN based multi-stream models for audio-visual speech recognition
Gowdy, JN
Subramanya, A
Bartels, C
Bilmes, J
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 993 - 996
[13] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
Liu, Hong
Li, Wenhao
Yang, Bing
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
[14] MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators
Tan, Zhixing
Zhang, Xiangwen
Wang, Shuo
Liu, Yang
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6131 - 6142
[15] MULTI-SCALE HYBRID FUSION NETWORK FOR MANDARIN AUDIO-VISUAL SPEECH RECOGNITION
Wang, Jinxin
Guo, Zhongwen
Yang, Chao
Li, Xiaomei
Cui, Ziyuan
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 642 - 647
[16] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
APPLIED ACOUSTICS, 2023, 211
[17] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[18] Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition
Chen, Hang
Wang, Qing
Du, Jun
Yin, Bao-Cai
Pan, Jia
Lee, Chin-Hui
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2508 - 2521
[19] DBN based models for audio-visual speech analysis and recognition
Ravyse, Ilse
Jiang, Dongmei
Jiang, Xiaoyue
Lv, Guoyun
Hou, Yunshu
Sahli, Hichem
Zhao, Rongchun
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2006, PROCEEDINGS, 2006, 4261 : 19 - 30
[20] Deep Audio-Visual Speech Recognition
Afouras, Triantafyllos
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727

← 1 2 3 4 5 →