Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

被引:0
|
作者
Zahran, Ahmed I. [1 ]
Fahmy, Aly A. [1 ]
Wassif, Khaled T. [1 ]
Bayomi, Hanaa [1 ]
机构
[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza 12613, Orman, Egypt
关键词
Automatic pronunciation assessment; pronunciation scoring; pre-trained speech representations; self-supervised speech representation learning; wav2vec; 2.0; WavLM; HuBERT;
D O I
10.1109/ACCESS.2023.3317236
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for the pronounced phonemes. In the second step, transfer learning is used to build a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is almost similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. The code is available at https://github.com/ai-zahran/E2E-R.
引用
收藏
页码:112650 / 112663
页数:14
相关论文
共 50 条
  • [31] Analysis of Pronunciation Learning in End-to-End Speech Synthesis
    Taylor, Jason
    Richmond, Korin
    INTERSPEECH 2019, 2019, : 2070 - 2074
  • [32] Self-supervised End-to-End ASR for Low Resource L2 Swedish
    Al-Ghezi, Ragheb
    Getman, Yaroslav
    Rouhe, Aku
    Hilden, Raili
    Kurimo, Mikko
    INTERSPEECH 2021, 2021, : 1429 - 1433
  • [33] SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model
    Wang, Jianzong
    Zhang, Xulong
    Tang, Haobin
    Sun, Aolan
    Cheng, Ning
    Xiao, Jing
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [34] Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation
    Wang, Chengyi
    Wu, Yu
    Liu, Shujie
    Yang, Zhenglu
    Zhou, Ming
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9161 - 9168
  • [35] FINE-TUNING OF PRE-TRAINED END-TO-END SPEECH RECOGNITION WITH GENERATIVE ADVERSARIAL NETWORKS
    Haidar, Md Akmal
    Rezagholizadeh, Mehdi
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6204 - 6208
  • [36] ON FINE-TUNING PRE-TRAINED SPEECH MODELS WITH EMA-TARGET SELF-SUPERVISED LOSS
    Yang, Hejung
    Kang, Hong-Goo
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6360 - 6364
  • [37] Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning
    Zhang, Yifan
    Hooi, Bryan
    Hu, Dapeng
    Liang, Jian
    Feng, Jiashi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [38] Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning
    Chen, Tianlong
    Liu, Sijia
    Chang, Shiyu
    Cheng, Yu
    Amini, Lisa
    Wang, Zhangyang
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 696 - 705
  • [39] A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end models
    Li, Yan
    Wang, Yapeng
    Hoi, Lap Man
    Yang, Dingcheng
    Im, Sio-Kei
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2025, 2025 (01):
  • [40] Data-Driven End-to-End Optimization of Radio Over Fiber Transmission System Based on Self-Supervised Learning
    Zhu, Yue
    Ye, Jia
    Yan, Lianshan
    Zhou, Tao
    Yu, Xiao
    Zou, Xihua
    Pan, Wei
    JOURNAL OF LIGHTWAVE TECHNOLOGY, 2024, 42 (21) : 7532 - 7543