Transformer Based End-to-End Mispronunciation Detection and Diagnosis

被引:15
|
作者
Wu, Minglin [1 ,3 ]
Li, Kun [2 ]
Leung, Wai-Kim [1 ,3 ]
Meng, Helen [1 ,3 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Human Comp Commun Lab, Hong Kong, Peoples R China
[2] SpeechX Ltd, Hong Kong, Peoples R China
[3] Ctr Perceptual & Interact Intelligence CPII Ltd, Hong Kong, Peoples R China
来源
关键词
Mispronunciation Detection and Diagnosis (MDD); Transformer; encoder-decoder; wav2vec; 2.0; CNN feature encoder;
D O I
10.21437/Interspeech.2021-1467
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper introduces two Transformer-based architectures for Mispronunciation Detection and Diagnosis (MDD). The first Transformer architecture (T-1) is a standard setup with an encoder, a decoder, a projection part and the Cross Entropy (CE) loss. T-1 takes in Mel-Frequency Cepstral Coefficients (MFCC) as input. The second architecture (T-2) is based on wav2vec 2.0, a pretraining framework. T-2 is composed of a CNN feature encoder, several Transformer blocks capturing contextual speech representations, a projection part and the Connectionist Temporal Classification (CTC) loss. Unlike T-1, T-2 takes in raw audio data as input. Both models are trained in an end-to-end manner. Experiments are conducted on the CU-CHLOE corpus, where T-1 achieves a Phone Error Rate (PER) of 8.69% and F-measure of 77.23%; and T-2 achieves a PER of 5.97% and F-measure of 80.98 %. Both models significantly outperform the previously proposed AGPM and CNN-RNN-CTC models, with PERs at 11.1% and 12.1% respectively, and F-measures at 72.61% and 74.65 % respectively.
引用
收藏
页码:3954 / 3958
页数:5
相关论文
共 50 条
  • [31] An End-to-End Air Writing Recognition Method Based on Transformer
    Tan, Xuhang
    Tong, Jicheng
    Matsumaru, Takafumi
    Dutta, Vibekananda
    He, Xin
    [J]. IEEE ACCESS, 2023, 11 : 109885 - 109898
  • [32] End-to-End Neural Transformer Based Spoken Language Understanding
    Radfar, Martin
    Mouchtaris, Athanasios
    Kunzmann, Siegfried
    [J]. INTERSPEECH 2020, 2020, : 866 - 870
  • [33] End-to-end methane gas detection algorithm based on transformer and multi-layer perceptron
    Liu, Chang
    Wang, Gang
    Zhang, Chen
    Patimisco, Pietro
    Cui, Ruyue
    Feng, Chaofan
    Sampaolo, Angelo
    Spagnolo, Vincenzo
    Dong, Lei
    Wu, Hongpeng
    [J]. OPTICS EXPRESS, 2024, 32 (01) : 987 - 1002
  • [34] An End-to-End Transformer Model for Crowd Localization
    Liang, Dingkang
    Xu, Wei
    Bai, Xiang
    [J]. COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 38 - 54
  • [35] End-to-End Multitask Learning With Vision Transformer
    Tian, Yingjie
    Bai, Kunlong
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (07) : 9579 - 9590
  • [36] AN EFFICIENT END-TO-END IMAGE COMPRESSION TRANSFORMER
    Jeny, Afsana Ahsan
    Junayed, Masum Shah
    Islam, Md Baharul
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1786 - 1790
  • [37] DocFormer: End-to-End Transformer for Document Understanding
    Appalaraju, Srikar
    Jasani, Bhavan
    Kota, Bhargava Urala
    Xie, Yusheng
    Manmatha, R.
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 973 - 983
  • [38] An end-to-end steel surface defect detection approach via Swin transformer
    Tang, Bo
    Song, Zi-Kai
    Sun, Wei
    Wang, Xing-Dong
    [J]. IET IMAGE PROCESSING, 2023, 17 (05) : 1334 - 1345
  • [39] End-to-End Transformer-Based Models in Textual-Based NLP
    Rahali, Abir
    Akhloufi, Moulay A.
    [J]. AI, 2023, 4 (01) : 54 - 110
  • [40] Sequential Transformer for End-to-End Person Search
    Chen, Long
    Xu, Jinhua
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 226 - 238