Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths

被引:2
|
作者
Liu, Jiajun [1 ,2 ]
Wumaier, Aishan [2 ,3 ]
Wei, Dongping [2 ,3 ]
Guo, Shen [2 ,3 ]
机构
[1] Xinjiang Univ, Coll Software, Urumqi 830046, Peoples R China
[2] Key Lab Multilingual Informat Technol Xinjiang Uyg, Urumqi 830046, Peoples R China
[3] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830046, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 13期
关键词
speech disfluency detection; stuttering; limited data; wav2vec2.0; entropy invariance; CLASSIFICATION; DYSFLUENCIES;
D O I
10.3390/app13137579
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Speech is critical for interpersonal communication, but not everyone has fluent communication skills. Speech disfluency, including stuttering and interruptions, affects not only emotional expression but also clarity of expression for people who stutter. Existing methods for detecting speech disfluency rely heavily on annotated data, which can be costly. Additionally, these methods have not considered the issue of variable-length disfluent speech, which limits the scalability of detection methods. To address these limitations, this paper proposes an automated method for detecting speech disfluency that can improve communication skills for individuals and assist therapists in tracking the progress of stuttering patients. The proposed method focuses on detecting four types of disfluency features using single-task detection and utilizes embeddings from the pre-trained wav2vec2.0 model, as well as convolutional neural network (CNN) and Transformer models for feature extraction. The model's scalability is improved by considering the issue of variable-length disfluent speech and modifying the model based on the entropy invariance of attention mechanisms. The proposed automated method for detecting speech disfluency has the potential to assist individuals in overcoming speech disfluency, improve their communication skills, and aid therapists in tracking the progress of stuttering patients. Additionally, the model's scalability across languages and lengths enhances its practical applicability. The experiments demonstrate that the model outperforms baseline models in both English and Chinese datasets, proving its universality and scalability in real-world applications.
引用
收藏
页数:25
相关论文
共 50 条
  • [21] WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition
    Li, Feng
    Luo, Jiusong
    Xia, Wanjun
    MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 325 - 336
  • [22] Comparison of wav2vec 2.0 models on three speech processing tasks
    Kunešová, Marie
    Zajíc, Zbyněk
    Šmídl, Luboš
    Karafiát, Martin
    International Journal of Speech Technology, 2024, 27 (04) : 847 - 859
  • [23] A Preliminary Study on Wav2Vec 2.0 Embeddings for Text-to-Speech
    Lim, Yohan
    Kim, Namhyeong
    Yun, Seung
    Kim, Hun
    Lee, Seung-Ik
    12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, : 343 - 347
  • [24] Wav2f0: Exploring the Potential of Wav2vec 2.0 for Speech Fundamental Frequency Extraction
    Feng, Rui
    Liu, Yin-Long
    Ling, Zhen-Hua
    Yuan, Jia-Hong
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 169 - 173
  • [25] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
    Baevski, Alexei
    Zhou, Henry
    Mohamed, Abdelrahman
    Auli, Michael
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [26] 基于Wav2vec2.0与语境情感信息补偿的对话语音情感识别
    曹荣贺
    吴晓龙
    冯畅
    郑方
    徐明星
    哈妮克孜伊拉洪
    艾斯卡尔艾木都拉
    信号处理, 2023, (04) : 698 - 707
  • [27] Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0
    Bayerl, Sebastian P.
    Wagner, Dominik
    Noeth, Elmar
    Riedhammer, Korbinian
    INTERSPEECH 2022, 2022, : 2868 - 2872
  • [28] Siamese Network with Wav2vec Feature for Spoofing Speech Detection
    Xie, Yang
    Zhang, Zhenchuan
    Yang, Yingchun
    INTERSPEECH 2021, 2021, : 4269 - 4273
  • [29] Unsupervised Spoken Term Discovery Using wav2vec 2.0
    Iwamoto, Yu
    Shinozaki, Takahiro
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 1082 - 1086
  • [30] MULTI-LINGUAL MULTI-TASK SPEECH EMOTION RECOGNITION USING WAV2VEC 2.0
    Sharma, Mayank
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6907 - 6911