Silent versus modal multi-speaker speech recognition from ultrasound and video

被引：4

作者：

Ribeiro, Manuel Sam ^{[1
,2
]}

Eshky, Aciel ^{[1
,3
]}

Richmond, Korin ^{[1
]}

Renals, Steve ^{[1
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

[2] Amazon, Seattle, WA 98109 USA

[3] Rasa Technol, Dhaka, Bangladesh

来源：

INTERSPEECH 2021 | 2021年

基金：

英国工程与自然科学研究理事会;

关键词：

silent speech interfaces; silent speech; ultrasound tongue imaging; video lip imaging; articulatory speech recognition; COMMUNICATION;

D O I：

10.21437/Interspeech.2021-23

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.

引用

页码：641 / 645

页数：5

共 50 条

[31] Research on ASIC for multi-speaker isolated word recognition
Xiong, B
Sun, YH
[J]. 1996 2ND INTERNATIONAL CONFERENCE ON ASIC, PROCEEDINGS, 1996, : 135 - 137
[32] Automatic speaker clustering from multi-speaker utterances
McLaughlin, J
Reynolds, D
Singer, E
O'Leary, GC
[J]. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 817 - 820
[33] INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS
Raj, Desh
Denisov, Pavel
Chen, Zhuo
Erdogan, Hakan
Huang, Zili
He, Maokui
Watanabe, Shinji
Du, Jun
Yoshioka, Takuya
Luo, Yi
Kanda, Naoyuki
Li, Jinyu
Wisdom, Scott
Hershey, John R.
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 897 - 904
[34] Single-speaker/multi-speaker co-channel speech classification
Rossignol, Stephane
Pietquini, Olivier
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2322 - 2325
[35] Unsupervised Discovery of Phoneme Boundaries in Multi-Speaker Continuous Speech
Armstrong, Tom
Antetomaso, Stephanie
[J]. 2011 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING (ICDL), 2011,
[36] LCMV BEAMFORMING WITH SUBSPACE PROJECTION FOR MULTI-SPEAKER SPEECH ENHANCEMENT
Hassani, Amin
Bertrand, Alexander
Moonen, Marc
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 91 - 95
[37] Video Based Silent Speech Recognition
Sangeetha, V
Justin, Judith
Mahalakshmi, A.
[J]. PROCEEDING OF THE INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS, BIG DATA AND IOT (ICCBI-2018), 2020, 31 : 269 - 277
[38] Neural Speech Tracking Highlights the Importance of Visual Speech in Multi-speaker Situations
Haider, Chandra L.
Park, Hyojin
Hauswald, Anne
Weisz, Nathan
[J]. JOURNAL OF COGNITIVE NEUROSCIENCE, 2024, 36 (01) : 128 - 142
[39] Integration of audio-visual information for multi-speaker multimedia speaker recognition
Yang, Jichen
Chen, Fangfan
Cheng, Yu
Lin, Pei
[J]. DIGITAL SIGNAL PROCESSING, 2024, 145
[40] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zheng, Yibin
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934

← 1 2 3 4 5 →