Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection

被引：1

作者：

Wadud, Md. Anwar Hussen ^{[1
]}

Alatiyyah, Mohammed ^{[2
]}

Mridha, M. F. ^{[3
]}

机构：

[1] Bangladesh Univ Business & Technol, Dept Comp Sci & Engn, Dhaka 1216, Bangladesh

[2] Prince Sattam Bin Abdulaziz Univ, Coll Sci & Humanities Aflaj, Dept Comp Sci, Al Kharj 16278, Saudi Arabia

[3] Amer Int Univ Bangladesh, Dept Comp Sci & Engn, Dhaka 1216, Bangladesh

来源：

APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 01期

关键词：

non-autoregressive; pronunciation modeling; speech recognition; mispronunciation detection and diagnosis; attention; computer-assisted pronunciation training (CAPT);

D O I：

10.3390/app13010109

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, such as forced alignment and extended recognition networks, for model development or for enhancing system performance. The incorporation of earlier texts into model training has recently been attempted using end-to-end (E2E)-based approaches, and preliminary results indicate efficacy. Attention-based end-to-end models have shown lower speech recognition performance because multi-pass left-to-right forward computation constrains their practical applicability in beam search. In addition, end-to-end neural approaches are typically data-hungry, and a lack of non-native training data will frequently impair their effectiveness in MDD. To solve this problem, we provide a unique MDD technique that uses non-autoregressive (NAR) end-to-end neural models to greatly reduce estimation time while maintaining accuracy levels similar to traditional E2E neural models. In contrast, NAR models can generate parallel token sequences by accepting parallel inputs instead of left-to-right forward computation. To further enhance the effectiveness of MDD, we develop and construct a pronunciation model superimposed on our approach's NAR end-to-end models. To test the effectiveness of our strategy against some of the best end-to-end models, we use publicly accessible L2-ARCTIC and SpeechOcean English datasets for training and testing purposes where the proposed model shows the best results than other existing models.

引用

页数：18

共 50 条

[21] IMPROVING NON-AUTOREGRESSIVE END-TO-END SPEECH RECOGNITION WITH PRE-TRAINED ACOUSTIC AND LANGUAGE MODELS
Deng, Keqi
Yang, Zehui
Watanabe, Shinji
Higuchi, Yosuke
Cheng, Gaofeng
Zhang, Pengyuan
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8522 - 8526
[22] FAST-MD: FAST MULTI-DECODER END-TO-END SPEECH TRANSLATION WITH NON-AUTOREGRESSIVE HIDDEN INTERMEDIATES
Inaguma, Hirofumi
Dalmia, Siddharth
Yan, Brian
Watanabe, Shinji
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 922 - 929
[23] End-to-end convolutional neural network design for automatic detection of influenza virus
Lee, Junghwan
Eom, Heesang
Hariyani, Yuli Sun
Kim, Cheonjung
Yoo, Yongkyoung
Lee, Jeonghoon
Park, Cheolsoo
[J]. Lee, Jeonghoon (jhlee0804@gmail.com), 1600, Institute of Electronics Engineers of Korea (10): : 31 - 36
[24] Modeling Coverage for Non-Autoregressive Neural Machine Translation
Shan, Yong
Feng, Yang
Shao, Chenze
[J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[25] Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT
Bai, Ye
Yi, Jiangyan
Tao, Jianhua
Tian, Zhengkun
Wen, Zhengqi
Zhang, Shuai
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1897 - 1911
[26] An end-to-end TTS model with pronunciation predictor
Han, Chol-Jin
Ri, Un-Chol
Mun, Song-Il
Jang, Kang-Song
Han, Song-Yun
[J]. International Journal of Speech Technology, 2022, 25 (04) : 1013 - 1024
[27] End-to-End Mispronunciation Detection with Simulated Error Distance
Zhang, Zhan
Wang, Yuehai
Yang, Jianyi
[J]. INTERSPEECH 2022, 2022, : 4327 - 4331
[28] An end-to-end TTS model with pronunciation predictor
Han C.-J.
Ri U.-C.
Mun S.-I.
Jang K.-S.
Han S.-Y.
[J]. International Journal of Speech Technology, 2022, 25 (4) : 1013 - 1024
[29] DECOUPLING PRONUNCIATION AND LANGUAGE FOR END-TO-END CODE-SWITCHING AUTOMATIC SPEECH RECOGNITION
Zhang, Shuai
Yi, Jiangyan
Tian, Zhengkun
Bai, Ye
Tao, Jianhua
Wen, Zhengqi
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6249 - 6253
[30] End-to-End Neural Network for Vehicle Dynamics Modeling
Hermansdorfer, Leonhard
Trauth, Rainer
Betz, Johannes
Lienkamp, Markus
[J]. 2020 6TH IEEE CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'20), 2020, : 407 - 412

← 1 2 3 4 5 →