Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

被引:0
|
作者
Dong, Zhongping [1 ]
Xu, Yan [1 ]
Abel, Andrew [2 ]
Wang, Dong [3 ]
机构
[1] Xian Jiaotong Liverpool Univ, Sch Adv Technol, Suzhou 215123, Peoples R China
[2] Univ Strathclyde, Comp & Informat Sci, Glasgow G1 1XQ, Scotland
[3] BNRist Tsinghua Univ, Ctr Speech & Language Technol CSLT, Beijing 100084, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 02期
关键词
speech reconstruction; lipreading; gabor features; lip features; speech synthesis; image processing; machine learning; FILTER;
D O I
10.3390/app14020798
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In environments characterised by noise or the absence of audio signals, visual cues,notably facial and lip movements, serve as valuable substitutes for missing or corrupted speechsignals. In these scenarios, speech reconstruction can potentially generate speech from visual data.Recent advancements in this domain have predominantly relied on end-to-end deep learning models,like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However,these models are encumbered by their intricate and opaque architectures, coupled with their lackof speaker independence. Consequently, achieving multi-speaker speech reconstruction withoutsupplementary information is challenging. This research introduces an innovative Gabor-basedspeech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration.Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speechand GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensionalmouth region features, encompassing filtered Gabor mouth images and low-dimensionalGabor features as visual inputs. An encoded spectrogram serves as the audio target, and a LongShort-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Throughcomprehensive experiments conducted on the GRID corpus, our proposed Gabor-based modelshave showcased superior performance in sentence and vocabulary reconstruction when compared totraditional end-to-end CNN models. These models stand out for their lightweight design and rapidprocessing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robustmulti-speaker speech reconstruction without necessitating supplementary information, therebymarking a significant milestone in the field of speech reconstruction.
引用
收藏
页数:30
相关论文
共 50 条
  • [41] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    [J]. INTERSPEECH 2019, 2019, : 4425 - 4429
  • [42] An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets
    Gallegos, Pilar Oplustil
    Williams, Jennifer
    Rownicka, Joanna
    King, Simon
    [J]. INTERSPEECH 2020, 2020, : 1758 - 1762
  • [43] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
    Lu, Chunhui
    Wen, Xue
    Liu, Ruolan
    Chen, Xiao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
  • [44] MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation
    Li, Xiyun
    Xu, Yong
    Yu, Meng
    Zhang, Shi-Xiong
    Xu, Jiaming
    Xu, Bo
    Yu, Dong
    [J]. INTERSPEECH 2021, 2021, : 1119 - 1123
  • [45] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2020, 2020, : 2032 - 2036
  • [46] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
    Chien, Chung-Ming
    Lin, Jheng-Hao
    Huang, Chien-yu
    Hsu, Po-chun
    Lee, Hung-yi
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
  • [47] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
    Liu, Zhaoyu
    Mak, Brian
    [J]. INTERSPEECH 2020, 2020, : 2932 - 2936
  • [48] J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
    Takamichi, Shinnosuke
    Nakata, Wataru
    Tanji, Naoko
    Saruwatari, Hiroshi
    [J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, 2022-September : 2358 - 2362
  • [49] MULTI-SPEAKER EMOTIONAL ACOUSTIC MODELING FOR CNN-BASED SPEECH SYNTHESIS
    Choi, Heejin
    Park, Sangjun
    Park, Jinuk
    Hahn, Minsoo
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6950 - 6954
  • [50] J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
    Takamichi, Shinnosuke
    Nakata, Wataru
    Tanji, Naoko
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2022, 2022, : 2358 - 2362