E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

被引:10
|
作者
Zhang, Jicheng [1 ]
Peng, Yizhou [1 ]
Pham, Van Tung [2 ]
Xu, Haihua [2 ]
Huang, Hao [1 ]
Chng, Eng Siong [2 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
来源
基金
国家重点研发计划;
关键词
End-to-end; speech recognition; accent recognition; joint training; multi-task learning;
D O I
10.21437/Interspeech.2021-1495
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose a single multi-task learning framework to perform End-to-End (E2E) speech recognition (ASR) and accent recognition (AR) simultaneously. The proposed framework is not only more compact but can also yield comparable or even better results than standalone systems. Specifically, we found that the overall performance is predominantly determined by the ASR task, and the E2E-based ASR pretraining is essential to achieve improved performance, particularly for the AR task. Additionally, we conduct several analyses of the proposed method. First, though the objective loss for the AR task is much smaller compared with its counterpart of ASR task, a smaller weighting factor with the AR task in the joint objective function is necessary to yield better results for each task. Second, we found that sharing only a few layers of the encoder yields better AR results than sharing the overall encoder. Experimentally, the proposed method produces WER results close to the best standalone E2E ASR ones, while it achieves 7.7% and 4.2% relative improvement over standalone and single-task-based joint recognition methods on test set for accent recognition respectively.
引用
收藏
页码:1519 / 1523
页数:5
相关论文
共 50 条
  • [1] Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition
    Shao, Qijie
    Guo, Pengcheng
    Yan, Jinghao
    Hu, Pengfei
    Xie, Lei
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 459 - 470
  • [2] Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning
    Jain, Abhinav
    Upreti, Minali
    Jyothi, Preethi
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2454 - 2458
  • [3] Speech Emotion Recognition based on Multi-Task Learning
    Zhao, Huijuan
    Han Zhijie
    Wang, Ruchuan
    [J]. 2019 IEEE 5TH INTL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY) / IEEE INTL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC) / IEEE INTL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2019, : 186 - 188
  • [4] Speech Emotion Recognition with Multi-task Learning
    Cai, Xingyu
    Yuan, Jiahong
    Zheng, Renjie
    Huang, Liang
    Church, Kenneth
    [J]. INTERSPEECH 2021, 2021, : 4508 - 4512
  • [5] Meta Multi-task Learning for Speech Emotion Recognition
    Cai, Ruichu
    Guo, Kaibin
    Xu, Boyan
    Yang, Xiaoyan
    Zhang, Zhenjie
    [J]. INTERSPEECH 2020, 2020, : 3336 - 3340
  • [6] MULTI-TASK JOINT-LEARNING OF DEEP NEURAL NETWORKS FOR ROBUST SPEECH RECOGNITION
    Qian, Yanmin
    Yin, Maofan
    You, Yongbin
    Yu, Kai
    [J]. 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 310 - 316
  • [7] A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-Speech
    Park, Byeongseon
    Yamamoto, Ryuichi
    Tachibana, Kentaro
    [J]. INTERSPEECH 2022, 2022, : 1931 - 1935
  • [8] Attention-based LSTM with Multi-task Learning for Distant Speech Recognition
    Zhang, Yu
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3857 - 3861
  • [9] Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning
    Zhao Huijuan
    Ye Ning
    Wang Ruchuan
    [J]. Journal of Signal Processing Systems, 2021, 93 : 299 - 308
  • [10] Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning
    Zhao, Huijuan
    Ye, Ning
    Wang, Ruchuan
    [J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2021, 93 (2-3): : 299 - 308