Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

被引：0

作者：

Dong, Zhongping ^{[1
]}

Xu, Yan ^{[1
]}

Abel, Andrew ^{[2
]}

Wang, Dong ^{[3
]}

机构：

[1] Xian Jiaotong Liverpool Univ, Sch Adv Technol, Suzhou 215123, Peoples R China

[2] Univ Strathclyde, Comp & Informat Sci, Glasgow G1 1XQ, Scotland

[3] BNRist Tsinghua Univ, Ctr Speech & Language Technol CSLT, Beijing 100084, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 02期

关键词：

speech reconstruction; lipreading; gabor features; lip features; speech synthesis; image processing; machine learning; FILTER;

D O I：

10.3390/app14020798

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

In environments characterised by noise or the absence of audio signals, visual cues,notably facial and lip movements, serve as valuable substitutes for missing or corrupted speechsignals. In these scenarios, speech reconstruction can potentially generate speech from visual data.Recent advancements in this domain have predominantly relied on end-to-end deep learning models,like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However,these models are encumbered by their intricate and opaque architectures, coupled with their lackof speaker independence. Consequently, achieving multi-speaker speech reconstruction withoutsupplementary information is challenging. This research introduces an innovative Gabor-basedspeech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration.Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speechand GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensionalmouth region features, encompassing filtered Gabor mouth images and low-dimensionalGabor features as visual inputs. An encoded spectrogram serves as the audio target, and a LongShort-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Throughcomprehensive experiments conducted on the GRID corpus, our proposed Gabor-based modelshave showcased superior performance in sentence and vocabulary reconstruction when compared totraditional end-to-end CNN models. These models stand out for their lightweight design and rapidprocessing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robustmulti-speaker speech reconstruction without necessitating supplementary information, therebymarking a significant milestone in the field of speech reconstruction.

引用

页数：30

共 50 条

[41] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
Denisov, Pavel
Ngoc Thang Vu
[J]. INTERSPEECH 2019, 2019, : 4425 - 4429
[42] An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets
Gallegos, Pilar Oplustil
Williams, Jennifer
Rownicka, Joanna
King, Simon
[J]. INTERSPEECH 2020, 2020, : 1758 - 1762
[43] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
Lu, Chunhui
Wen, Xue
Liu, Ruolan
Chen, Xiao
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
[44] MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation
Li, Xiyun
Xu, Yong
Yu, Meng
Zhang, Shi-Xiong
Xu, Jiaming
Xu, Bo
Yu, Dong
[J]. INTERSPEECH 2021, 2021, : 1119 - 1123
[45] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
Mitsui, Kentaro
Koriyama, Tomoki
Saruwatari, Hiroshi
[J]. INTERSPEECH 2020, 2020, : 2032 - 2036
[46] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
Chien, Chung-Ming
Lin, Jheng-Hao
Huang, Chien-yu
Hsu, Po-chun
Lee, Hung-yi
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
[47] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Liu, Zhaoyu
Mak, Brian
[J]. INTERSPEECH 2020, 2020, : 2932 - 2936
[48] J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
Takamichi, Shinnosuke
Nakata, Wataru
Tanji, Naoko
Saruwatari, Hiroshi
[J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, 2022-September : 2358 - 2362
[49] MULTI-SPEAKER EMOTIONAL ACOUSTIC MODELING FOR CNN-BASED SPEECH SYNTHESIS
Choi, Heejin
Park, Sangjun
Park, Jinuk
Hahn, Minsoo
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6950 - 6954
[50] J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
Takamichi, Shinnosuke
Nakata, Wataru
Tanji, Naoko
Saruwatari, Hiroshi
[J]. INTERSPEECH 2022, 2022, : 2358 - 2362

← 1 2 3 4 5 →