EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引:1
|
作者
Miao, Chenfeng [1 ]
Zhu, Qingying [1 ]
Chen, Minchuan [1 ]
Ma, Jun [1 ]
Wang, Shaojun [1 ]
Xiao, Jing [1 ]
机构
[1] Ping Technol, Shanghai 200120, Peoples R China
关键词
Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;
D O I
10.1109/TASLP.2024.3369528
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.
引用
收藏
页码:1650 / 1661
页数:12
相关论文
共 50 条
  • [31] Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
    Chung, Hyunseung
    Lee, Sang-Hoon
    Lee, Seong-Whan
    INTERSPEECH 2021, 2021, : 3635 - 3639
  • [32] END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE
    Yasuda, Yusuke
    Wang, Xin
    Yamagishi, Junichi
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5694 - 5698
  • [33] Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech
    Kakegawa, Naoto
    Hara, Sunao
    Abe, Masanobu
    Ijima, Yusuke
    INTERSPEECH 2021, 2021, : 126 - 130
  • [34] End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue
    Mitsui, Kentaro
    Zhao, Tianyu
    Sawada, Kei
    Hono, Yukiya
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    INTERSPEECH 2022, 2022, : 2328 - 2332
  • [35] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
    Fahmy, Fady K.
    Abbas, Hazem M.
    Khalil, Mahmoud, I
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 79 - 88
  • [36] You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
    Laptev, Aleksandr
    Korostik, Roman
    Svischev, Aleksey
    Andrusenko, Andrei
    Medennikov, Ivan
    Rybin, Sergey
    2020 13TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2020), 2020, : 439 - 444
  • [37] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
    Fady K. Fahmy
    Hazem M. Abbas
    Mahmoud I. Khalil
    International Journal of Speech Technology, 2022, 25 : 79 - 88
  • [38] End-to-End Voice Conversion with Information Perturbation
    Xie, Qicong
    Yang, Shan
    Lei, Yi
    Xie, Lei
    Su, Dan
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 91 - 95
  • [39] VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS
    Zhang, Yongmao
    Cong, Jian
    Xue, Heyang
    Xie, Lei
    Zhu, Pengcheng
    Bi, Mengxiao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7237 - 7241
  • [40] Development of robotic voice conversion for RIBO using text-to-speech synthesis
    Hossain, Md. Jakir
    Al Amin, Sayed Mahmud
    Islam, Md. Saiful
    Marium-E-Jannat
    2018 4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT), 2018, : 422 - 425