Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

被引:0
|
作者
Dong, Zhongping [1 ]
Xu, Yan [1 ]
Abel, Andrew [2 ]
Wang, Dong [3 ]
机构
[1] Xian Jiaotong Liverpool Univ, Sch Adv Technol, Suzhou 215123, Peoples R China
[2] Univ Strathclyde, Comp & Informat Sci, Glasgow G1 1XQ, Scotland
[3] BNRist Tsinghua Univ, Ctr Speech & Language Technol CSLT, Beijing 100084, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 02期
关键词
speech reconstruction; lipreading; gabor features; lip features; speech synthesis; image processing; machine learning; FILTER;
D O I
10.3390/app14020798
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In environments characterised by noise or the absence of audio signals, visual cues,notably facial and lip movements, serve as valuable substitutes for missing or corrupted speechsignals. In these scenarios, speech reconstruction can potentially generate speech from visual data.Recent advancements in this domain have predominantly relied on end-to-end deep learning models,like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However,these models are encumbered by their intricate and opaque architectures, coupled with their lackof speaker independence. Consequently, achieving multi-speaker speech reconstruction withoutsupplementary information is challenging. This research introduces an innovative Gabor-basedspeech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration.Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speechand GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensionalmouth region features, encompassing filtered Gabor mouth images and low-dimensionalGabor features as visual inputs. An encoded spectrogram serves as the audio target, and a LongShort-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Throughcomprehensive experiments conducted on the GRID corpus, our proposed Gabor-based modelshave showcased superior performance in sentence and vocabulary reconstruction when compared totraditional end-to-end CNN models. These models stand out for their lightweight design and rapidprocessing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robustmulti-speaker speech reconstruction without necessitating supplementary information, therebymarking a significant milestone in the field of speech reconstruction.
引用
收藏
页数:30
相关论文
共 50 条
  • [21] Single-speaker/multi-speaker co-channel speech classification
    Rossignol, Stephane
    Pietquini, Olivier
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2322 - 2325
  • [22] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
  • [23] Sparse Component Analysis for Speech Recognition in Multi-Speaker Environment
    Asaei, Afsaneh
    Bourlard, Herve
    Garner, Philip N.
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1704 - 1707
  • [24] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
  • [25] SPEAKER CONDITIONING OF ACOUSTIC MODELS USING AFFINE TRANSFORMATION FOR MULTI-SPEAKER SPEECH RECOGNITION
    Yousefi, Midia
    Hansen, John H. L.
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 283 - 288
  • [26] A unified network for multi-speaker speech recognition with multi-channel recordings
    Liu, Conggui
    Inoue, Nakamasa
    Shinoda, Koichi
    [J]. 2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1304 - 1307
  • [27] A Multi-channel/Multi-speaker Articulatory Database in Mandarin for Speech Visualization
    Zhang, Dan
    Liu, Xianqian
    Yan, Nan
    Wang, Lan
    Zhu, Yun
    Chen, Hui
    [J]. 2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 299 - +
  • [28] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Zheng, Yibin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
  • [29] Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features
    Zhang, Xuejie
    Xu, Yan
    Abel, Andrew K.
    Smith, Leslie S.
    Watt, Roger
    Hussain, Amir
    Gao, Chengxiang
    [J]. ENTROPY, 2020, 22 (12) : 1 - 24
  • [30] CLeLfPC: a Large Open Multi-Speaker Corpus of French Cued Speech
    Bigi, Brigitte
    Zimmermann, Maryvonne
    Andre, Carine
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 987 - 994