Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

被引:0
|
作者
Dong, Zhongping [1 ]
Xu, Yan [1 ]
Abel, Andrew [2 ]
Wang, Dong [3 ]
机构
[1] Xian Jiaotong Liverpool Univ, Sch Adv Technol, Suzhou 215123, Peoples R China
[2] Univ Strathclyde, Comp & Informat Sci, Glasgow G1 1XQ, Scotland
[3] BNRist Tsinghua Univ, Ctr Speech & Language Technol CSLT, Beijing 100084, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 02期
关键词
speech reconstruction; lipreading; gabor features; lip features; speech synthesis; image processing; machine learning; FILTER;
D O I
10.3390/app14020798
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In environments characterised by noise or the absence of audio signals, visual cues,notably facial and lip movements, serve as valuable substitutes for missing or corrupted speechsignals. In these scenarios, speech reconstruction can potentially generate speech from visual data.Recent advancements in this domain have predominantly relied on end-to-end deep learning models,like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However,these models are encumbered by their intricate and opaque architectures, coupled with their lackof speaker independence. Consequently, achieving multi-speaker speech reconstruction withoutsupplementary information is challenging. This research introduces an innovative Gabor-basedspeech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration.Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speechand GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensionalmouth region features, encompassing filtered Gabor mouth images and low-dimensionalGabor features as visual inputs. An encoded spectrogram serves as the audio target, and a LongShort-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Throughcomprehensive experiments conducted on the GRID corpus, our proposed Gabor-based modelshave showcased superior performance in sentence and vocabulary reconstruction when compared totraditional end-to-end CNN models. These models stand out for their lightweight design and rapidprocessing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robustmulti-speaker speech reconstruction without necessitating supplementary information, therebymarking a significant milestone in the field of speech reconstruction.
引用
收藏
页数:30
相关论文
共 50 条
  • [1] Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
    Singh, Abhayjeet
    Nagireddi, Amala
    Jayakumar, Anjali
    Deekshitha, G.
    Bandekar, Jesuraja
    Roopa, R.
    Badiger, Sandhya
    Udupa, Sathvik
    Kumar, Saurabh
    Ghosh, Prasanta Kumar
    Murthy, Hema A.
    Zen, Heiga
    Kumar, Pranaw
    Kant, Kamal
    Bole, Amol
    Singh, Bira Chandra
    Tokuda, Keiichi
    Hasegawa-Johnson, Mark
    Olbrich, Philipp
    [J]. IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 790 - 798
  • [2] LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 499 - 506
  • [3] MultiSpeech: Multi-Speaker Text to Speech with Transformer
    Chen, Mingjian
    Tan, Xu
    Ren, Yi
    Xu, Jin
    Sun, Hao
    Zhao, Sheng
    Qin, Tao
    [J]. INTERSPEECH 2020, 2020, : 4024 - 4028
  • [4] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387
  • [5] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
    Das, Rohan Kumar
    Yang, Jichen
    Li, Haizhou
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
  • [6] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [7] Speaker identification using speech and lip features
    Ou, GB
    Li, X
    Yao, XC
    Jia, HB
    Murphey, YL
    [J]. PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), VOLS 1-5, 2005, : 2565 - 2570
  • [8] Neural Speech Tracking Highlights the Importance of Visual Speech in Multi-speaker Situations
    Haider, Chandra L.
    Park, Hyojin
    Hauswald, Anne
    Weisz, Nathan
    [J]. JOURNAL OF COGNITIVE NEUROSCIENCE, 2024, 36 (01) : 128 - 142
  • [9] TOWARDS MULTI-SPEAKER UNSUPERVISED SPEECH PATTERN DISCOVERY
    Zhang, Yaodong
    Glass, James R.
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4366 - 4369
  • [10] MULTI-SPEAKER, NARROWBAND, CONTINUOUS MARATHI SPEECH DATABASE
    Godambe, Tejas
    Bondale, Nandini
    Samudravijaya, K.
    Rao, Preeti
    [J]. 2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,