Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths

被引：2

作者：

Liu, Jiajun ^{[1
,2
]}

Wumaier, Aishan ^{[2
,3
]}

Wei, Dongping ^{[2
,3
]}

Guo, Shen ^{[2
,3
]}

机构：

[1] Xinjiang Univ, Coll Software, Urumqi 830046, Peoples R China

[2] Key Lab Multilingual Informat Technol Xinjiang Uyg, Urumqi 830046, Peoples R China

[3] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830046, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 13期

关键词：

speech disfluency detection; stuttering; limited data; wav2vec2.0; entropy invariance; CLASSIFICATION; DYSFLUENCIES;

D O I：

10.3390/app13137579

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Speech is critical for interpersonal communication, but not everyone has fluent communication skills. Speech disfluency, including stuttering and interruptions, affects not only emotional expression but also clarity of expression for people who stutter. Existing methods for detecting speech disfluency rely heavily on annotated data, which can be costly. Additionally, these methods have not considered the issue of variable-length disfluent speech, which limits the scalability of detection methods. To address these limitations, this paper proposes an automated method for detecting speech disfluency that can improve communication skills for individuals and assist therapists in tracking the progress of stuttering patients. The proposed method focuses on detecting four types of disfluency features using single-task detection and utilizes embeddings from the pre-trained wav2vec2.0 model, as well as convolutional neural network (CNN) and Transformer models for feature extraction. The model's scalability is improved by considering the issue of variable-length disfluent speech and modifying the model based on the entropy invariance of attention mechanisms. The proposed automated method for detecting speech disfluency has the potential to assist individuals in overcoming speech disfluency, improve their communication skills, and aid therapists in tracking the progress of stuttering patients. Additionally, the model's scalability across languages and lengths enhances its practical applicability. The experiments demonstrate that the model outperforms baseline models in both English and Chinese datasets, proving its universality and scalability in real-world applications.

引用

页数：25

共 50 条

[31] Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition
Sun, Chenjing
Zhou, Yi
Huang, Xin
Yang, Jichen
Hou, Xianhua
ELECTRONICS, 2024, 13 (06)
[32] Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
Zhang, Yumei
Jia, Maoshen
Cao, Xuan
Zhao, Zichen
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 398 - 402
[33] Applying the conformal prediction paradigm for the uncertainty quantification of an end-to-end automatic speech recognition model (wav2vec 2.0)
Ernez, Fares
Arnold, Alexandre
Galametz, Audrey
Kobus, Catherine
Ould-Amer, Nawal
CONFORMAL AND PROBABILISTIC PREDICTION WITH APPLICATIONS, VOL 204, 2023, 204 : 16 - 35
[34] BrainTalker: Low-Resource Brain-to-Speech Synthesis with Transfer Learning using Wav2Vec 2.0
Kim, Miseul
Piao, Zhenyu
Lee, Jihyun
Kang, Hong-Goo
2023 IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, BHI, 2023,
[35] Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
Park, Somin
Mark, Mpabulungi
Park, Bogyung
Hong, Hyunki
CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 77 (01): : 1009 - 1030
[36] An improved wav2vec 2.0 pre-training approach using enhanced local dependency modeling for speech recognition
Zhu, Qiu-shi
Zhang, Jie
Wu, Ming-hui
Fang, Xin
Dai, Li-Rong
INTERSPEECH 2021, 2021, : 4334 - 4338
[37] PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0
Banno, Stefano
Matassoni, Marco
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1088 - 1095
[38] WavBERT: Exploiting Semantic and Non-semantic Speech using Wav2vec and BERT for Dementia Detection
Zhu, Youxiang
Obyat, Abdelrahman
Liang, Xiaohui
Batsis, John A.
Roth, Robert M.
INTERSPEECH 2021, 2021, : 3790 - 3794
[39] Exploring the potential of Wav2vec 2.0 for speech emotion recognition using classifier combination and attention-based feature fusion
Nasersharif, Babak
Namvarpour, Mohammad
JOURNAL OF SUPERCOMPUTING, 2024, 80 (16): : 23667 - 23688
[40] Improving Speech Translation Accuracy and Time Efficiency With Fine-Tuned wav2vec 2.0-Based Speech Segmentation
Fukuda, Ryo
Sudoh, Katsuhito
Nakamura, Satoshi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 906 - 916

← 1 2 3 4 5 →