Automated Speech Audiometry: Can It Work Using Open-Source Pre-Trained Kaldi-NL Automatic Speech Recognition?

被引:0
|
作者
Araiza-Illan, Gloria [1 ,2 ]
Meyer, Luke [1 ,2 ]
Truong, Khiet P. [3 ]
Baskent, Deniz [1 ,2 ]
机构
[1] Univ Groningen, Univ Med Ctr Groningen, Dept Otorhinolaryngol Head & Neck Surg, Groningen, Netherlands
[2] Univ Groningen, Univ Med Ctr Groningen, WJ Kolff Inst Biomed Engn & Mat Sci, Groningen, Netherlands
[3] Univ Twente, Human Media Interact, Enschede, Netherlands
来源
TRENDS IN HEARING | 2024年 / 28卷
关键词
speech audiometry; speech perception; automatic speech recognition; speech-in-noise hearing test; digits-in-noise test; NOISE; INTELLIGIBILITY; THRESHOLD; LISTENERS; HEARING;
D O I
10.1177/23312165241229057
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
A practical speech audiometry tool is the digits-in-noise (DIN) test for hearing screening of populations of varying ages and hearing status. The test is usually conducted by a human supervisor (e.g., clinician), who scores the responses spoken by the listener, or online, where software scores the responses entered by the listener. The test has 24-digit triplets presented in an adaptive staircase procedure, resulting in a speech reception threshold (SRT). We propose an alternative automated DIN test setup that can evaluate spoken responses whilst conducted without a human supervisor, using the open-source automatic speech recognition toolkit, Kaldi-NL. Thirty self-reported normal-hearing Dutch adults (19-64 years) completed one DIN + Kaldi-NL test. Their spoken responses were recorded and used for evaluating the transcript of decoded responses by Kaldi-NL. Study 1 evaluated the Kaldi-NL performance through its word error rate (WER), percentage of summed decoding errors regarding only digits found in the transcript compared to the total number of digits present in the spoken responses. Average WER across participants was 5.0% (range 0-48%, SD = 8.8%), with average decoding errors in three triplets per participant. Study 2 analyzed the effect that triplets with decoding errors from Kaldi-NL had on the DIN test output (SRT), using bootstrapping simulations. Previous research indicated 0.70 dB as the typical within-subject SRT variability for normal-hearing adults. Study 2 showed that up to four triplets with decoding errors produce SRT variations within this range, suggesting that our proposed setup could be feasible for clinical applications.
引用
收藏
页数:13
相关论文
共 17 条
  • [1] Automatic Speech Recognition Dataset Augmentation with Pre-Trained Model and Script
    Kwon, Minsu
    Choi, Ho-Jin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 649 - 651
  • [2] Using Open-Source Automatic Speech Recognition Tools for the Annotation of Dutch Infant-Directed Speech
    van der Klis, Anika
    Adriaans, Frans
    Han, Mengru
    Kager, Rene
    [J]. MULTIMODAL TECHNOLOGIES AND INTERACTION, 2023, 7 (07)
  • [3] Speech Recognition System Using Open-Source Speech Engine for Indian Names
    Kallole, Nitin Arun
    Prakash, R.
    [J]. INTELLIGENT EMBEDDED SYSTEMS, ICNETS2, VOL II, 2018, 492 : 263 - 274
  • [4] GENERATING HUMAN READABLE TRANSCRIPT FOR AUTOMATIC SPEECH RECOGNITION WITH PRE-TRAINED LANGUAGE MODEL
    Liao, Junwei
    Shi, Yu
    Gong, Ming
    Shou, Linjun
    Eskimez, Sefik
    Lu, Liyang
    Qu, Hong
    Zeng, Michael
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7578 - 7582
  • [5] Evaluating Open-source Toolkits for Automatic Speech Recognition of South African Languages
    Naidoo, Ashentha
    Tsoeu, Mohohlo
    [J]. 2019 SOUTHERN AFRICAN UNIVERSITIES POWER ENGINEERING CONFERENCE/ROBOTICS AND MECHATRONICS/PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA (SAUPEC/ROBMECH/PRASA), 2019, : 160 - 165
  • [6] Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings
    Girish, K. V. Vijay
    Konjeti, Srikanth
    Vepa, Jithendra
    [J]. INTERSPEECH 2022, 2022, : 4496 - 4500
  • [7] FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
    Takashima, Yuki
    Nakashika, Toru
    Takiguchi, Tetsuya
    Ariki, Yasuo
    [J]. 2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1411 - 1415
  • [8] Alzheimer Disease Recognition Using Speech-Based Embeddings From Pre-Trained Models
    Gauder, Lara
    Pepino, Leonardo
    Ferrer, Luciana
    Riera, Pablo
    [J]. INTERSPEECH 2021, 2021, : 3795 - 3799
  • [9] Non-Autoregressive ASR Modeling Using Pre-Trained Language Models for Chinese Speech Recognition
    Yu, Fu-Hao
    Chen, Kuan-Yu
    Lu, Ke-Han
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1474 - 1482
  • [10] PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models
    Feng, Tiantian
    Narayanan, Shrikanth
    [J]. 2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, ACII, 2023,