Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

被引:0
|
作者
Guan, Wenhao [1 ]
Li, Tao [1 ]
Li, Yishuang [2 ]
Huang, Hukai [1 ]
Hong, Qingyang [1 ]
Li, Lin [2 ,3 ]
机构
[1] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[3] Xiamen Univ, Sch Elect Sci & Engn, Xiamen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
speech synthesis; style transfer; variational autoencoder; diffusion probabilistic model;
D O I
10.21437/Interspeech.2023-1151
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-spectrograms. Specifically, a two-stage and a one-stage system are designed respectively, to improve the audio quality and the performance of style transfer. Secondly, a diffusion bridge of quantized VAE is designed to efficiently learn complex discrete style representations and improve the performance of style transfer. To have a better ability of style transfer, we introduce ControlVAE to improve the reconstruction quality and have good interpretability simultaneously. Experiments on LibriTTS dataset demonstrate that our method is more effective than baseline models
引用
收藏
页码:4304 / 4308
页数:5
相关论文
共 50 条
  • [21] StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
    Li, Yinghao Aaron
    Han, Cong
    Raghavan, Vinay S.
    Mischler, Gavin
    Mesgarani, Nima
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [22] Lombard Speech Synthesis using Transfer Learning in a Tacotron Text-to-Speech System
    Bollepalli, Bajibabu
    Juvela, Lauri
    Alku, Paavo
    INTERSPEECH 2019, 2019, : 2833 - 2837
  • [23] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
    Guan, Wenhao
    Li, Yishuang
    Li, Tao
    Huang, Hukai
    Wang, Feng
    Lin, Jiayan
    Huang, Lingyan
    Li, Lin
    Hong, Qingyang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
  • [24] EXPRESSIVITY TRANSFER IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS
    Hamed, Mohamed
    Lachiri, Zied
    2024 IEEE 7TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES, SIGNAL AND IMAGE PROCESSING, ATSIP 2024, 2024, : 443 - 448
  • [25] The clear speech intelligibility benefit for text-to-speech voices: Effects of speaking style and visual guise
    Aoki, Nicholas B.
    Cohn, Michelle
    Zellou, Georgia
    JASA EXPRESS LETTERS, 2022, 2 (04):
  • [26] PVAE-TTS: ADAPTIVE TEXT-TO-SPEECH VIA PROGRESSIVE STYLE ADAPTATION
    Lee, Ji-Hyun
    Lee, Sang-Hoon
    Kim, Ji-Hoon
    Lee, Seong-Whan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6312 - 6316
  • [27] LLM-based Expressive Text-to-Speech Synthesizer with Style and Timbre disentanglement
    Zhu, Yuanyuan
    He, Jiaxu
    Jing, Ruihao
    Song, Yaodong
    Lian, Jie
    Zhang, Xiao-Lei
    Li, Jie
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 596 - 600
  • [28] JAPANESE TEXT-TO-SPEECH SYNTHESIZER
    NAGAKURA, K
    HAKODA, K
    KABEYA, K
    REVIEW OF THE ELECTRICAL COMMUNICATIONS LABORATORIES, 1988, 36 (05): : 451 - 457
  • [29] Multilingual text-to-speech synthesis
    Black, AW
    Lenzo, KA
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 761 - 764
  • [30] Slovenian text-to-speech system
    Sef, T
    ISCAS 2000: IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS - PROCEEDINGS, VOL V: EMERGING TECHNOLOGIES FOR THE 21ST CENTURY, 2000, : 41 - 44