Transformer Based End-to-End Mispronunciation Detection and Diagnosis

被引：15

作者：

Wu, Minglin ^{[1
,3
]}

Li, Kun ^{[2
]}

Leung, Wai-Kim ^{[1
,3
]}

Meng, Helen ^{[1
,3
]}

机构：

[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Human Comp Commun Lab, Hong Kong, Peoples R China

[2] SpeechX Ltd, Hong Kong, Peoples R China

[3] Ctr Perceptual & Interact Intelligence CPII Ltd, Hong Kong, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

关键词：

Mispronunciation Detection and Diagnosis (MDD); Transformer; encoder-decoder; wav2vec; 2.0; CNN feature encoder;

D O I：

10.21437/Interspeech.2021-1467

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper introduces two Transformer-based architectures for Mispronunciation Detection and Diagnosis (MDD). The first Transformer architecture (T-1) is a standard setup with an encoder, a decoder, a projection part and the Cross Entropy (CE) loss. T-1 takes in Mel-Frequency Cepstral Coefficients (MFCC) as input. The second architecture (T-2) is based on wav2vec 2.0, a pretraining framework. T-2 is composed of a CNN feature encoder, several Transformer blocks capturing contextual speech representations, a projection part and the Connectionist Temporal Classification (CTC) loss. Unlike T-1, T-2 takes in raw audio data as input. Both models are trained in an end-to-end manner. Experiments are conducted on the CU-CHLOE corpus, where T-1 achieves a Phone Error Rate (PER) of 8.69% and F-measure of 77.23%; and T-2 achieves a PER of 5.97% and F-measure of 80.98 %. Both models significantly outperform the previously proposed AGPM and CNN-RNN-CTC models, with PERs at 11.1% and 12.1% respectively, and F-measures at 72.61% and 74.65 % respectively.

引用

页码：3954 / 3958

页数：5

共 50 条

[31] An End-to-End Air Writing Recognition Method Based on Transformer
Tan, Xuhang
Tong, Jicheng
Matsumaru, Takafumi
Dutta, Vibekananda
He, Xin
[J]. IEEE ACCESS, 2023, 11 : 109885 - 109898
[32] End-to-End Neural Transformer Based Spoken Language Understanding
Radfar, Martin
Mouchtaris, Athanasios
Kunzmann, Siegfried
[J]. INTERSPEECH 2020, 2020, : 866 - 870
[33] End-to-end methane gas detection algorithm based on transformer and multi-layer perceptron
Liu, Chang
Wang, Gang
Zhang, Chen
Patimisco, Pietro
Cui, Ruyue
Feng, Chaofan
Sampaolo, Angelo
Spagnolo, Vincenzo
Dong, Lei
Wu, Hongpeng
[J]. OPTICS EXPRESS, 2024, 32 (01) : 987 - 1002
[34] An End-to-End Transformer Model for Crowd Localization
Liang, Dingkang
Xu, Wei
Bai, Xiang
[J]. COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 38 - 54
[35] End-to-End Multitask Learning With Vision Transformer
Tian, Yingjie
Bai, Kunlong
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (07) : 9579 - 9590
[36] AN EFFICIENT END-TO-END IMAGE COMPRESSION TRANSFORMER
Jeny, Afsana Ahsan
Junayed, Masum Shah
Islam, Md Baharul
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1786 - 1790
[37] DocFormer: End-to-End Transformer for Document Understanding
Appalaraju, Srikar
Jasani, Bhavan
Kota, Bhargava Urala
Xie, Yusheng
Manmatha, R.
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 973 - 983
[38] An end-to-end steel surface defect detection approach via Swin transformer
Tang, Bo
Song, Zi-Kai
Sun, Wei
Wang, Xing-Dong
[J]. IET IMAGE PROCESSING, 2023, 17 (05) : 1334 - 1345
[39] End-to-End Transformer-Based Models in Textual-Based NLP
Rahali, Abir
Akhloufi, Moulay A.
[J]. AI, 2023, 4 (01) : 54 - 110
[40] Sequential Transformer for End-to-End Person Search
Chen, Long
Xu, Jinhua
[J]. NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 226 - 238

← 1 2 3 4 5 →