Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

被引:7
|
作者
Geng, Mengzhe [1 ]
Xie, Xurong [2 ]
Ye, Zi [1 ]
Wang, Tianzi [1 ]
Li, Guinan [1 ]
Hu, Shujie [1 ]
Liu, Xunying [1 ]
Meng, Helen [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong 999077, Peoples R China
[2] Chinese Acad Sci, Inst Software, Beijing 100045, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech recognition; Older adults; Adaptation models; Speech; Acoustics; Training; Task analysis; Speaker adaptation; disordered speech recognition; elderly speech recognition; TRANSFORMATIONS; NORMALIZATION; HMM;
D O I
10.1109/TASLP.2022.3195113
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech in recent decades, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. Sources of heterogeneity commonly found in normal speech including accent or gender, when further compounded with the variability over age and speech pathology severity level, create large diversity among speakers. To this end, speaker adaptation techniques play a key role in personalization of ASR systems for such users. Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis deep embedding features derived using SVD speech spectrum decomposition are proposed in this paper to facilitate auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN/TDNN and end-to-end Conformer speech recognition systems. Experiments were conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The proposed spectro-temporal deep feature adapted systems outperformed baseline i-Vector and x-Vector adaptation by up to 2.63% absolute (8.63% relative) reduction in word error rate (WER). Consistent performance improvements were retained after model based speaker adaptation using learning hidden unit contributions (LHUC) was further applied. The best speaker adapted system using the proposed spectral basis embedding features produced the lowest published WER of 25.05% on the UASpeech test set of 16 dysarthric speakers.
引用
收藏
页码:2597 / 2611
页数:15
相关论文
共 50 条
  • [21] Methods for capturing spectro-temporal modulations in automatic speech recognition
    Kleinschmidt, M
    [J]. ACTA ACUSTICA UNITED WITH ACUSTICA, 2002, 88 (03) : 416 - 422
  • [22] SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DYSARTHRIC SPEECH RECOGNITION
    Soleymanpour, Mohammad
    Johnson, Michael T.
    Soleymanpour, Rahim
    Berry, Jeffrey
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7382 - 7386
  • [23] Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation
    Choi, Yong-Sun
    Lee, Soo-Young
    [J]. NEURAL NETWORKS, 2013, 45 : 62 - 69
  • [24] Robust emotion recognition by spectro-temporal modulation statistic features
    Tai-Shih Chi
    Lan-Ying Yeh
    Chin-Cheng Hsu
    [J]. Journal of Ambient Intelligence and Humanized Computing, 2012, 3 : 47 - 60
  • [25] Robust emotion recognition by spectro-temporal modulation statistic features
    Chi, Tai-Shih
    Yeh, Lan-Ying
    Hsu, Chin-Cheng
    [J]. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2012, 3 (01) : 47 - 60
  • [26] Aging and Spectro-Temporal Integration of Speech
    Grose, John H.
    Porter, Heather L.
    Buss, Emily
    [J]. TRENDS IN HEARING, 2016, 20
  • [27] Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition
    Schaedler, Marc Rene
    Kollmeier, Birger
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2015, 137 (04): : 2047 - 2059
  • [28] Regularized Speaker Adaptation of KL-HMM for Dysarthric Speech Recognition
    Kim, Myungjong
    Kim, Younggwan
    Yoo, Joohong
    Wang, Jun
    Kim, Hoirin
    [J]. IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2017, 25 (09) : 1581 - 1591
  • [29] POINT PROCESS MODELS OF SPECTRO-TEMPORAL MODULATION EVENTS FOR SPEECH RECOGNITION
    Jansen, Aren
    Mesgarani, Nima
    Niyogi, Partha
    [J]. 2010 CONFERENCE RECORD OF THE FORTY FOURTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS (ASILOMAR), 2010, : 104 - 108
  • [30] Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
    Schaedler, Marc Rene
    Meyer, Bernd T.
    Kollmeier, Birger
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2012, 131 (05): : 4134 - 4151