Silent versus modal multi-speaker speech recognition from ultrasound and video

被引:4
|
作者
Ribeiro, Manuel Sam [1 ,2 ]
Eshky, Aciel [1 ,3 ]
Richmond, Korin [1 ]
Renals, Steve [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
[2] Amazon, Seattle, WA 98109 USA
[3] Rasa Technol, Dhaka, Bangladesh
来源
基金
英国工程与自然科学研究理事会;
关键词
silent speech interfaces; silent speech; ultrasound tongue imaging; video lip imaging; articulatory speech recognition; COMMUNICATION;
D O I
10.21437/Interspeech.2021-23
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.
引用
收藏
页码:641 / 645
页数:5
相关论文
共 50 条
  • [21] Multi-speaker Recognition in Cocktail Party Problem
    Wang, Yiqian
    Sun, Wensheng
    [J]. COMMUNICATIONS, SIGNAL PROCESSING, AND SYSTEMS, 2019, 463 : 2116 - 2123
  • [22] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
    Huang, Wen-Chin
    Wu, Yi-Chiao
    Toda, Tomoki
    [J]. IEEE Signal Processing Letters, 2024, 31 : 2995 - 2999
  • [23] Multi-modal co-learning for silent speech recognition based on ultrasound tongue images
    Guo, Minghao
    Wei, Jianguo
    Zhang, Ruiteng
    Zhao, Yu
    Fang, Qiang
    [J]. Speech Communication, 2024, 165
  • [24] TOWARDS MULTI-SPEAKER UNSUPERVISED SPEECH PATTERN DISCOVERY
    Zhang, Yaodong
    Glass, James R.
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4366 - 4369
  • [25] Real-time End-to-End Monaural Multi-speaker Speech Recognition
    Li, Song
    Ouyang, Beibei
    Tong, Fuchuan
    Liao, Dexin
    Li, Lin
    Hong, Qingyang
    [J]. INTERSPEECH 2021, 2021, : 3750 - 3754
  • [26] Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding
    Kim, Minsoo
    Jang, Gil-Jin
    [J]. Applied Sciences (Switzerland), 2024, 14 (18):
  • [27] MULTI-SPEAKER, NARROWBAND, CONTINUOUS MARATHI SPEECH DATABASE
    Godambe, Tejas
    Bondale, Nandini
    Samudravijaya, K.
    Rao, Preeti
    [J]. 2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [28] An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets
    Gallegos, Pilar Oplustil
    Williams, Jennifer
    Rownicka, Joanna
    King, Simon
    [J]. INTERSPEECH 2020, 2020, : 1758 - 1762
  • [29] MULTI-SPEAKER CONVERSATIONS, CROSS-TALK, AND DIARIZATION FOR SPEAKER RECOGNITION
    Sell, Gregory
    McCree, Alan
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5425 - 5429
  • [30] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    [J]. INTERSPEECH 2021, 2021, : 2337 - 2338