FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

被引:5
|
作者
Wang, Yongqi [1 ]
Zhao, Zhou [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
基金
浙江省自然科学基金; 国家重点研发计划; 中国国家自然科学基金;
关键词
lip-to-speech synthesis; multimodal translation; deep learning;
D O I
10.1145/3503161.3548194
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves 19.76x speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.
引用
收藏
页码:5678 / 5687
页数:10
相关论文
共 50 条
  • [1] Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Zhang, Shuai
    Wen, Zhengqi
    [J]. INTERSPEECH 2020, 2020, : 5026 - 5030
  • [2] ORTHROS: NON-AUTOREGRESSIVE END-TO-END SPEECH TRANSLATION WITH DUAL-DECODER
    Inaguma, Hirofumi
    Higuchi, Yosuke
    Duh, Kevin
    Kawahara, Tatsuya
    Watanabe, Shinji
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7503 - 7507
  • [3] Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
    Gao, Zhifu
    Zhang, Shiliang
    McLoughlin, Ian
    Yan, Zhijie
    [J]. INTERSPEECH 2022, 2022, : 2063 - 2067
  • [4] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [5] NON-AUTOREGRESSIVE END-TO-END AUTOMATIC SPEECH RECOGNITION INCORPORATING DOWNSTREAM NATURAL LANGUAGE PROCESSING
    Omachi, Motoi
    Fujita, Yuya
    Watanabe, Shinji
    Wang, Tianzi
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6772 - 6776
  • [6] Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation
    Chuang, Shun-Po
    Chuang, Yung-Sung
    Chang, Chih-Chiang
    Lee, Hung-yi
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 1068 - 1077
  • [7] NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING
    Li, Mohan
    Doddipatla, Rama
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 390 - 397
  • [8] A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition
    Fan, Ruchao
    Chu, Wei
    Chang, Peng
    Alwan, Abeer
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1436 - 1448
  • [9] Non-autoregressive Deliberation-Attention based End-to-End ASR
    Gao, Changfeng
    Cheng, Gaofeng
    Zhou, Jun
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [10] Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models
    Wang, Tianzi
    Fujita, Yuya
    Chang, Xuankai
    Watanabe, Shinji
    [J]. INTERSPEECH 2021, 2021, : 3755 - 3759