Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

被引:1
|
作者
Park, Seongyeon [1 ]
Kim, Bohyung [1 ]
Oh, Tae-Hyun [2 ,3 ,4 ]
机构
[1] CNAI, Seoul, South Korea
[2] Yonsei Univ, Inst Convergence Res & Educ Adv Technol, Seoul, South Korea
[3] POSTECH, Dept EE, Pohang, South Korea
[4] POSTECH, GSAI, Pohang, South Korea
来源
关键词
Zero-shot; Voice Conversion; Text-to-speech; Speech Synthesis; Efficient Optimum Discovery;
D O I
10.21437/Interspeech.2023-58
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum without search, by inducing the decoder of VITS-based models to its full reconstruction ability. With our framework, we show superior performance compared to baselines in zero-shot TTS and VC, achieving state-of-the-art performance. Furthermore, we show the robustness of our framework in various settings. We provide an explanation for the results in the discussion.
引用
收藏
页码:4319 / 4323
页数:5
相关论文
共 2 条
  • [1] Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network
    Gul, Sania
    Khan, Muhammad Salman
    Ur-Rehman, Ata
    PLOS ONE, 2024, 19 (07):
  • [2] Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion
    Xu, Le
    Zhong, Rongxiu
    Liu, Ying
    Yang, Huibao
    Zhang, Shilei
    INTERSPEECH 2023, 2023, : 2293 - 2297