VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

被引:2
|
作者
Lu, Hui [1 ,2 ]
Wu, Zhiyong [1 ,3 ]
Wu, Xixin [4 ]
Li, Xu [1 ]
Kang, Shiyin [5 ]
Liu, Xunying [1 ]
Meng, Helen [1 ,2 ,3 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China
[2] CUHK, Ctr Perceptual & Interact Intelligence, Hong Kong, Peoples R China
[3] Tsinghua Univ, Shenzhen Int Grad Sch, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen, Peoples R China
[4] Univ Cambridge, Dept Engn, Cambridge, England
[5] Huya Inc, Guangzhou, Peoples R China
来源
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Non-autoregressive TTS; VAE; Glow; Transformer;
D O I
10.21437/Interspeech.2021-2121
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through forced alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level durations. The VAENAR-TTS model does not contain recurrent structures and is completely non-autoregressive in both the training and inference phases. Based on the VAE architecture, the alignment information is encoded in the latent variable, and attention-based soft alignment between the text and the latent variable is used in the decoder to reconstruct the spectrogram. Experiments show that VAENAR-TTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models.
引用
收藏
页码:3775 / 3779
页数:5
相关论文
共 18 条
  • [1] Estonian Text-to-Speech Synthesis with Non-autoregressive Transformers
    Ratsep, Liisa
    Lellep, Rasmus
    Fishel, Mark
    [J]. BALTIC JOURNAL OF MODERN COMPUTING, 2022, 10 (03): : 447 - 456
  • [2] FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis
    Zhou, Xun
    Zhou, Zhiyang
    Shi, Xiaodong
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [3] FLOW-TTS: A NON-AUTOREGRESSIVE NETWORK FOR TEXT TO SPEECH BASED ON FLOW
    Miao, Chenfeng
    Liang, Shuang
    Chen, Minchuan
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7209 - 7213
  • [4] Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech
    Bae, Jae-Sung
    Yang, Jinhyeok
    Bak, Tae-Jun
    Joo, Young-Sun
    [J]. INTERSPEECH 2022, 2022, : 813 - 817
  • [5] MIXER-TTS: NON-AUTOREGRESSIVE, FAST AND COMPACT TEXT-TO-SPEECH MODEL CONDITIONED ON LANGUAGE MODEL EMBEDDINGS
    Tatanov, Oktai
    Beliaev, Stanislav
    Ginsburg, Boris
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7482 - 7486
  • [6] Conditional Variational Auto-Encoder for Text-Driven Expressive AudioVisual Speech Synthesis
    Dahmani, Sara
    Colotte, Vincent
    Girard, Valerian
    Ouni, Slim
    [J]. INTERSPEECH 2019, 2019, : 2598 - 2602
  • [7] LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 499 - 506
  • [8] Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech
    Li, Yang
    Yu, Cheng
    Sun, Guangzhi
    Jiang, Hua
    Sun, Fanglei
    Zu, Weiqin
    Wen, Ying
    Yang, Yang
    Wang, Jun
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 391 - 400
  • [9] Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech
    Zhan, Haoyue
    Yu, Xinyuan
    Zhang, Haitong
    Zhang, Yang
    Lin, Yue
    [J]. INTERSPEECH 2022, 2022, : 4247 - 4251
  • [10] Differentiable Duration Refinement Using Internal Division for Non-Autoregressive Text-to-Speech
    Lee, Jaeuk
    Shin, Yoonsoo
    Chang, Joon-Hyuk
    [J]. IEEE Signal Processing Letters, 2024, 31 : 3154 - 3158