Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder

被引：8

作者：

Csapo, Tamas Gabor ^{[1
,2
]}

Al-Radhi, Mohammed Salah ^{[1
]}

Nemeth, Geza ^{[1
]}

Gosztolya, Gabor ^{[3
,4
]}

Grosz, Tamas ^{[4
]}

Toth, Laszlo ^{[4
]}

Marko, Alexandra ^{[2
,5
]}

机构：

[1] Budapest Univ Technol & Econ, Dept Telecommun & Media Informat, Budapest, Hungary

[2] MTA ELTE Lendulet Lingual Articulat Res Grp, Budapest, Hungary

[3] MTA SZTE Res Grp Artificial Intelligence, Szeged, Hungary

[4] Univ Szeged, Inst Informat, Szeged, Hungary

[5] Eotvos Lorand Univ, Dept Phonet, Budapest, Hungary

来源：

INTERSPEECH 2019 | 2019年

关键词：

Silent speech interface; articulatory-to-acoustic mapping; F0; prediction; CNN; HMM;

D O I：

10.21437/Interspeech.2019-2046

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-to-acoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.

引用

页码：894 / 898

页数：5

共 50 条

[1] Eigentongue feature extraction for an ultrasound-based silent speech interface
Hueber, T.
Aversano, G.
Chollet, G.
Denby, B.
Dreyfus, G.
Oussar, Y.
Roussel, P.
Stone, M.
[J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PTS 1-3, PROCEEDINGS, 2007, : 1245 - +
[2] Silent vs Vocalized Articulation for a Portable Ultrasound-Based Silent Speech Interface
Florescu, Victoria-M
Crevier-Buchman, Lise
Denby, Bruce
Hueber, Thomas
Colazo-Simon, Antonia
Pillot-Loiseau, Claire
Roussel, Pierre
Gendrot, Cedric
Quattrocchi, Sophie
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 450 - +
[3] Ultrasound-Based Silent Speech Interface Using Convolutional and Recurrent Neural Networks
Moliner Juanpere, Eloi
Csapo, Tamas Gabor
[J]. ACTA ACUSTICA UNITED WITH ACUSTICA, 2019, 105 (04) : 587 - 590
[4] Ultrasound-Based Silent Speech Interface using Sequential Convolutional Auto-encoder
Xu, Kele
Wu, Yuxiang
Gao, Zhifeng
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2194 - 2195
[5] Statistical Mapping between Articulatory and Acoustic Data for an Ultrasound-based Silent Speech Interface
Hueber, Thomas
Benaroya, Elie-Laurent
Denby, Bruce
Chollet, Gerard
[J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 600 - +
[6] Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces
Shandiz, Amin Honarmandi
Toth, Laszlo
Gosztolya, Gabor
Marko, Alexandra
Csapo, Tamas Gabor
[J]. INTERSPEECH 2021, 2021, : 1932 - 1936
[7] Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces
Toth, Laszlo
Gosztolya, Gabor
Grosz, Tamas
Marko, Alexandra
Csapo, Tamas Gabor
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3172 - 3176
[8] DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface
Csapo, Temas Gabor
Grosz, Tamas
Gosztolya, Gabor
Toth, Laszlo
Marko, Alexandra
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3672 - 3676
[9] Vocoder-Based Speech Synthesis from Silent Videos
Michelsanti, Daniel
Slizovskaia, Olga
Haro, Gloria
Gomez, Emilia
Tan, Zheng-Hua
Jensen, Jesper
[J]. INTERSPEECH 2020, 2020, : 3530 - 3534
[10] Visuo-Phonetic Decoding using Multi-Stream and Context-Dependent Models for an Ultrasound-based Silent Speech Interface
Hueber, Thomas
Benaroya, Elie-Laurent
Chollet, Gerard
Denby, Bruce
Dreyfus, Gerard
Stone, Maureen
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 628 - +

← 1 2 3 4 5 →