Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

被引:5
|
作者
Geng, Mengzhe [1 ]
Liu, Shansong [1 ]
Yu, Jianwei [1 ]
Xie, Xurong [2 ]
Hu, Shoukang [1 ]
Ye, Zi [1 ]
Jin, Zengrui [1 ]
Liu, Xunying [1 ]
Meng, Helen [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Beijing, Peoples R China
来源
关键词
Speech Disorders; Speech Recognition; Speaker Adaptation; Speech Assessment; Subspace-based Learning; DYSARTHRIA;
D O I
10.21437/Interspeech.2021-60
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN and end-to-end disordered speech recognition systems. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i-Vector adaptation by up to 2.63% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation. Learning hidden unit contribution (LHUC) based speaker adaptation was further applied. The final speaker adapted system using the proposed spectral basis embedding features gave an overall WER of 25.6% on the UASpeech test set of 16 dysarthric speakers.
引用
收藏
页码:4793 / 4797
页数:5
相关论文
共 50 条
  • [1] Hierarchical spectro-temporal features for robust speech recognition
    Domont, Xavier
    Heckmann, Martin
    Joublin, Frank
    Goerick, Christian
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4417 - 4420
  • [2] Spectro-Temporal Directional Derivative Features for Automatic Speech Recognition
    Gibson, James
    Van Segbroeck, Maarten
    Ortega, Antonio
    Georgiou, Panayiotis
    Narayanan, Shrikanth
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 872 - 875
  • [3] Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition
    Geng, Mengzhe
    Xie, Xurong
    Ye, Zi
    Wang, Tianzi
    Li, Guinan
    Hu, Shujie
    Liu, Xunying
    Meng, Helen
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2597 - 2611
  • [4] Multi-Stream Spectro-Temporal Features for Robust Speech Recognition
    Zhao, Sherry Y.
    Morgan, Nelson
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 898 - 901
  • [5] Joint Optimization of Spectro-Temporal Features and Deep Neural Nets for Robust Automatic Speech Recognition
    Kovacs, Gyorgy
    Toth, Laszlo
    [J]. ACTA CYBERNETICA, 2015, 22 (01): : 117 - 134
  • [6] Development of spectro-temporal features of speech in children
    Gautam S.
    Singh L.
    [J]. Gautam, Sumanlata (suman.gautam82@gmail.com), 1600, Springer Science and Business Media, LLC (20): : 543 - 551
  • [7] SPECTRO-TEMPORAL GABOR FEATURES FOR SPEAKER RECOGNITION
    Lei, Howard
    Meyer, Bernd T.
    Mirghafori, Nikki
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4241 - 4244
  • [8] Hilbert Envelope Based Spectro-Temporal Features for Phoneme Recognition in Telephone Speech
    Thomas, Samuel
    Ganapathy, Sriram
    Hermansky, Hynek
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1521 - +
  • [9] Informative Spectro-Temporal Bottleneck Features for Noise-Robust Speech Recognition
    Chang, Shuo-Yiin
    Morgan, Nelson
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 99 - 103
  • [10] Data-Driven and Feedback Based Spectro-Temporal Features for Speech Recognition
    Sivaram, G. S. V. S.
    Nemala, Sridhar Krishna
    Mesgarani, Nima
    Hermansky, Hynek
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2010, 17 (11) : 957 - 960