Transformer Based End-to-End Mispronunciation Detection and Diagnosis

被引：15

作者：

Wu, Minglin ^{[1
,3
]}

Li, Kun ^{[2
]}

Leung, Wai-Kim ^{[1
,3
]}

Meng, Helen ^{[1
,3
]}

机构：

[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Human Comp Commun Lab, Hong Kong, Peoples R China

[2] SpeechX Ltd, Hong Kong, Peoples R China

[3] Ctr Perceptual & Interact Intelligence CPII Ltd, Hong Kong, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

关键词：

Mispronunciation Detection and Diagnosis (MDD); Transformer; encoder-decoder; wav2vec; 2.0; CNN feature encoder;

D O I：

10.21437/Interspeech.2021-1467

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper introduces two Transformer-based architectures for Mispronunciation Detection and Diagnosis (MDD). The first Transformer architecture (T-1) is a standard setup with an encoder, a decoder, a projection part and the Cross Entropy (CE) loss. T-1 takes in Mel-Frequency Cepstral Coefficients (MFCC) as input. The second architecture (T-2) is based on wav2vec 2.0, a pretraining framework. T-2 is composed of a CNN feature encoder, several Transformer blocks capturing contextual speech representations, a projection part and the Connectionist Temporal Classification (CTC) loss. Unlike T-1, T-2 takes in raw audio data as input. Both models are trained in an end-to-end manner. Experiments are conducted on the CU-CHLOE corpus, where T-1 achieves a Phone Error Rate (PER) of 8.69% and F-measure of 77.23%; and T-2 achieves a PER of 5.97% and F-measure of 80.98 %. Both models significantly outperform the previously proposed AGPM and CNN-RNN-CTC models, with PERs at 11.1% and 12.1% respectively, and F-measures at 72.61% and 74.65 % respectively.

引用

页码：3954 / 3958

页数：5

共 50 条

[1] CNN-RNN-CTC BASED END-TO-END MISPRONUNCIATION DETECTION AND DIAGNOSIS
Leung, Wai-Kim
Liu, Xunying
Meng, Helen
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 8132 - 8136
[2] End-to-End Mispronunciation Detection and Diagnosis From Raw Waveforms
Yan, Bi-Cheng
Chen, Berlin
[J]. 29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 61 - 65
[3] End-to-End Mispronunciation Detection and Diagnosis Using Transfer Learning
Peng, Linkai
Gao, Yingming
Bao, Rian
Li, Ya
Zhang, Jinsong
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (11):
[4] End-to-End Mispronunciation Detection with Simulated Error Distance
Zhang, Zhan
Wang, Yuehai
Yang, Jianyi
[J]. INTERSPEECH 2022, 2022, : 4327 - 4331
[5] An Effective End-to-End Modeling Approach for Mispronunciation Detection
Lo, Tien-Hong
Weng, Shi-Yan
Chang, Hsiu-Jui
Chen, Berlin
[J]. INTERSPEECH 2020, 2020, : 3027 - 3031
[6] EXPLORING NON-AUTOREGRESSIVE END-TO-END NEURAL MODELING FOR ENGLISH MISPRONUNCIATION DETECTION AND DIAGNOSIS
Wang, Hsin-Wei
Yan, Bi-Cheng
Chiu, Hsuan-Sheng
Hsu, Yung-Chang
Chen, Berlin
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6817 - 6821
[7] Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms
Tien-Hong Lo
Yao-Ting Sung
Chen, Berlin
[J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 1049 - 1055
[8] Self-Supervised Pre-Trained Speech Representation Based End-to-End Mispronunciation Detection and Diagnosis of Mandarin
Shen, Yunfei
Liu, Qingqing
Fan, Zhixing
Liu, Jiajun
Wumaier, Aishan
[J]. IEEE ACCESS, 2022, 10 : 106451 - 106462
[9] End-to-End Temporal Action Detection With Transformer
Liu, Xiaolong
Wang, Qimeng
Hu, Yao
Tang, Xu
Zhang, Shiwei
Bai, Song
Bai, Xiang
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5427 - 5441
[10] End-to-end lane detection with convolution and transformer
Zekun Ge
Chao Ma
Zhumu Fu
Shuzhong Song
Pengju Si
[J]. Multimedia Tools and Applications, 2023, 82 : 29607 - 29627

← 1 2 3 4 5 →