Transformer Based End-to-End Mispronunciation Detection and Diagnosis

被引:15
|
作者
Wu, Minglin [1 ,3 ]
Li, Kun [2 ]
Leung, Wai-Kim [1 ,3 ]
Meng, Helen [1 ,3 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Human Comp Commun Lab, Hong Kong, Peoples R China
[2] SpeechX Ltd, Hong Kong, Peoples R China
[3] Ctr Perceptual & Interact Intelligence CPII Ltd, Hong Kong, Peoples R China
来源
关键词
Mispronunciation Detection and Diagnosis (MDD); Transformer; encoder-decoder; wav2vec; 2.0; CNN feature encoder;
D O I
10.21437/Interspeech.2021-1467
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper introduces two Transformer-based architectures for Mispronunciation Detection and Diagnosis (MDD). The first Transformer architecture (T-1) is a standard setup with an encoder, a decoder, a projection part and the Cross Entropy (CE) loss. T-1 takes in Mel-Frequency Cepstral Coefficients (MFCC) as input. The second architecture (T-2) is based on wav2vec 2.0, a pretraining framework. T-2 is composed of a CNN feature encoder, several Transformer blocks capturing contextual speech representations, a projection part and the Connectionist Temporal Classification (CTC) loss. Unlike T-1, T-2 takes in raw audio data as input. Both models are trained in an end-to-end manner. Experiments are conducted on the CU-CHLOE corpus, where T-1 achieves a Phone Error Rate (PER) of 8.69% and F-measure of 77.23%; and T-2 achieves a PER of 5.97% and F-measure of 80.98 %. Both models significantly outperform the previously proposed AGPM and CNN-RNN-CTC models, with PERs at 11.1% and 12.1% respectively, and F-measures at 72.61% and 74.65 % respectively.
引用
收藏
页码:3954 / 3958
页数:5
相关论文
共 50 条
  • [1] CNN-RNN-CTC BASED END-TO-END MISPRONUNCIATION DETECTION AND DIAGNOSIS
    Leung, Wai-Kim
    Liu, Xunying
    Meng, Helen
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 8132 - 8136
  • [2] End-to-End Mispronunciation Detection and Diagnosis From Raw Waveforms
    Yan, Bi-Cheng
    Chen, Berlin
    [J]. 29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 61 - 65
  • [3] End-to-End Mispronunciation Detection and Diagnosis Using Transfer Learning
    Peng, Linkai
    Gao, Yingming
    Bao, Rian
    Li, Ya
    Zhang, Jinsong
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (11):
  • [4] End-to-End Mispronunciation Detection with Simulated Error Distance
    Zhang, Zhan
    Wang, Yuehai
    Yang, Jianyi
    [J]. INTERSPEECH 2022, 2022, : 4327 - 4331
  • [5] An Effective End-to-End Modeling Approach for Mispronunciation Detection
    Lo, Tien-Hong
    Weng, Shi-Yan
    Chang, Hsiu-Jui
    Chen, Berlin
    [J]. INTERSPEECH 2020, 2020, : 3027 - 3031
  • [6] EXPLORING NON-AUTOREGRESSIVE END-TO-END NEURAL MODELING FOR ENGLISH MISPRONUNCIATION DETECTION AND DIAGNOSIS
    Wang, Hsin-Wei
    Yan, Bi-Cheng
    Chiu, Hsuan-Sheng
    Hsu, Yung-Chang
    Chen, Berlin
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6817 - 6821
  • [7] Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms
    Tien-Hong Lo
    Yao-Ting Sung
    Chen, Berlin
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 1049 - 1055
  • [8] Self-Supervised Pre-Trained Speech Representation Based End-to-End Mispronunciation Detection and Diagnosis of Mandarin
    Shen, Yunfei
    Liu, Qingqing
    Fan, Zhixing
    Liu, Jiajun
    Wumaier, Aishan
    [J]. IEEE ACCESS, 2022, 10 : 106451 - 106462
  • [9] End-to-End Temporal Action Detection With Transformer
    Liu, Xiaolong
    Wang, Qimeng
    Hu, Yao
    Tang, Xu
    Zhang, Shiwei
    Bai, Song
    Bai, Xiang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5427 - 5441
  • [10] End-to-end lane detection with convolution and transformer
    Zekun Ge
    Chao Ma
    Zhumu Fu
    Shuzhong Song
    Pengju Si
    [J]. Multimedia Tools and Applications, 2023, 82 : 29607 - 29627