Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

被引：0

作者：

Zahran, Ahmed I. ^{[1
]}

Fahmy, Aly A. ^{[1
]}

Wassif, Khaled T. ^{[1
]}

Bayomi, Hanaa ^{[1
]}

机构：

[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza 12613, Orman, Egypt

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Automatic pronunciation assessment; pronunciation scoring; pre-trained speech representations; self-supervised speech representation learning; wav2vec; 2.0; WavLM; HuBERT;

D O I：

10.1109/ACCESS.2023.3317236

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for the pronounced phonemes. In the second step, transfer learning is used to build a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is almost similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. The code is available at https://github.com/ai-zahran/E2E-R.

引用

页码：112650 / 112663

页数：14

共 50 条

[31] Analysis of Pronunciation Learning in End-to-End Speech Synthesis
Taylor, Jason
Richmond, Korin
INTERSPEECH 2019, 2019, : 2070 - 2074
[32] Self-supervised End-to-End ASR for Low Resource L2 Swedish
Al-Ghezi, Ragheb
Getman, Yaroslav
Rouhe, Aku
Hilden, Raili
Kurimo, Mikko
INTERSPEECH 2021, 2021, : 1429 - 1433
[33] SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model
Wang, Jianzong
Zhang, Xulong
Tang, Haobin
Sun, Aolan
Cheng, Ning
Xiao, Jing
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[34] Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation
Wang, Chengyi
Wu, Yu
Liu, Shujie
Yang, Zhenglu
Zhou, Ming
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9161 - 9168
[35] FINE-TUNING OF PRE-TRAINED END-TO-END SPEECH RECOGNITION WITH GENERATIVE ADVERSARIAL NETWORKS
Haidar, Md Akmal
Rezagholizadeh, Mehdi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6204 - 6208
[36] ON FINE-TUNING PRE-TRAINED SPEECH MODELS WITH EMA-TARGET SELF-SUPERVISED LOSS
Yang, Hejung
Kang, Hong-Goo
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6360 - 6364
[37] Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning
Zhang, Yifan
Hooi, Bryan
Hu, Dapeng
Liang, Jian
Feng, Jiashi
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[38] Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning
Chen, Tianlong
Liu, Sijia
Chang, Shiyu
Cheng, Yu
Amini, Lisa
Wang, Zhangyang
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 696 - 705
[39] A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end models
Li, Yan
Wang, Yapeng
Hoi, Lap Man
Yang, Dingcheng
Im, Sio-Kei
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2025, 2025 (01):
[40] Data-Driven End-to-End Optimization of Radio Over Fiber Transmission System Based on Self-Supervised Learning
Zhu, Yue
Ye, Jia
Yan, Lianshan
Zhou, Tao
Yu, Xiao
Zou, Xihua
Pan, Wei
JOURNAL OF LIGHTWAVE TECHNOLOGY, 2024, 42 (21) : 7532 - 7543

← 1 2 3 4 5 →