Silent versus modal multi-speaker speech recognition from ultrasound and video

被引：4

作者：

Ribeiro, Manuel Sam ^{[1
,2
]}

Eshky, Aciel ^{[1
,3
]}

Richmond, Korin ^{[1
]}

Renals, Steve ^{[1
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

[2] Amazon, Seattle, WA 98109 USA

[3] Rasa Technol, Dhaka, Bangladesh

来源：

INTERSPEECH 2021 | 2021年

基金：

英国工程与自然科学研究理事会;

关键词：

silent speech interfaces; silent speech; ultrasound tongue imaging; video lip imaging; articulatory speech recognition; COMMUNICATION;

D O I：

10.21437/Interspeech.2021-23

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.

引用

页码：641 / 645

页数：5

共 50 条

[1] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
Settle, Shane
Le Roux, Jonathan
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
[2] Speech Recognition and Multi-Speaker Diarization of Long Conversations
Mao, Huanru Henry
Li, Shuyang
McAuley, Julian
Cottrell, Garrison W.
[J]. INTERSPEECH 2020, 2020, : 691 - 695
[3] End-to-End Multilingual Multi-Speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
[J]. INTERSPEECH 2019, 2019, : 3755 - 3759
[4] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
Chang, Xuankai
Zhang, Wangyou
Qian, Yanmin
Le Roux, Jonathan
Watanabe, Shinji
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
[5] Sparse Component Analysis for Speech Recognition in Multi-Speaker Environment
Asaei, Afsaneh
Bourlard, Herve
Garner, Philip N.
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1704 - 1707
[6] SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DYSARTHRIC SPEECH RECOGNITION
Soleymanpour, Mohammad
Johnson, Michael T.
Soleymanpour, Rahim
Berry, Jeffrey
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7382 - 7386
[7] SPEAKER CONDITIONING OF ACOUSTIC MODELS USING AFFINE TRANSFORMATION FOR MULTI-SPEAKER SPEECH RECOGNITION
Yousefi, Midia
Hansen, John H. L.
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 283 - 288
[8] A unified network for multi-speaker speech recognition with multi-channel recordings
Liu, Conggui
Inoue, Nakamasa
Shinoda, Koichi
[J]. 2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1304 - 1307
[9] VCVTS: MULTI-SPEAKER VIDEO-TO-SPEECH SYNTHESIS VIA CROSS-MODAL KNOWLEDGE TRANSFER FROM VOICE CONVERSION
Wang, Disong
Yang, Shan
Su, Dan
Liu, Xunying
Yu, Dong
Meng, Helen
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7252 - 7256
[10] A Purely End-to-end System for Multi-speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
[J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630

← 1 2 3 4 5 →