EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

被引:1
|
作者
Miao, Chenfeng [1 ]
Zhu, Qingying [1 ]
Chen, Minchuan [1 ]
Ma, Jun [1 ]
Wang, Shaojun [1 ]
Xiao, Jing [1 ]
机构
[1] Ping Technol, Shanghai 200120, Peoples R China
关键词
Training; Vectors; Computational modeling; Task analysis; Acoustics; Couplings; Computer architecture; Text-to-speech; speech synthesis; voice conversion; differentiable aligner; VAE; hierarchical-VAE; end-to-end;
D O I
10.1109/TASLP.2024.3369528
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.
引用
收藏
页码:1650 / 1661
页数:12
相关论文
共 50 条
  • [21] SEMI-SUPERVISED END-TO-END SPEECH RECOGNITION USING TEXT-TO-SPEECH AND AUTOENCODERS
    Karita, Shigeki
    Watanabe, Shinji
    Iwata, Tomoharu
    Delcroix, Marc
    Ogawa, Atsunori
    Nakatani, Tomohiro
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6166 - 6170
  • [22] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
    Fujii, Kazuki
    Saito, Yuki
    Saruwatari, Hiroshi
    Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022, 2022, : 1702 - 1707
  • [23] Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2
    Mandeel, Ali Raheem
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    INFOCOMMUNICATIONS JOURNAL, 2022, 14 (03): : 55 - 62
  • [24] Generic Indic Text-to-speech Synthesisers with Rapid Adaptation in an End-to-end Framework
    Prakash, Anusha
    Murthy, Hema A.
    INTERSPEECH 2020, 2020, : 2962 - 2966
  • [25] Optimization for Low-Resource Speaker Adaptation in End-to-End Text-to-Speech
    Hong, Changi
    Lee, Jung Hyuk
    Jeon, Moongu
    Kim, Hong Kook
    2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1060 - 1061
  • [26] Myanmar Text-to-Speech System based on Tacotron (End-to-End Generative Model)
    Win, Yuzana
    Lwin, Htoo Pyae
    Masada, Tomonari
    11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 572 - 577
  • [27] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
    Cho, Hyunjae
    Jung, Wonbin
    Lee, Junhyeok
    Woo, Sang Hoon
    INTERSPEECH 2022, 2022, : 1 - 5
  • [28] End-to-End Speech Synthesis for Bangla with Text Normalization
    Pial, Tanzir Islam
    Aunti, Shahreen Salim
    Ahmed, Shabbir
    Heickal, Hasnain
    2018 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE/ INTELLIGENCE AND APPLIED INFORMATICS (CSII 2018), 2018, : 66 - 71
  • [29] End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders
    Masumura, Ryo
    Sato, Hiroshi
    Tanaka, Tomohiro
    Moriya, Takafumi
    Ijima, Yusuke
    Oba, Takanobu
    INTERSPEECH 2019, 2019, : 1606 - 1610
  • [30] A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning
    Oyucu, Saadin
    ELECTRONICS, 2023, 12 (08)