Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

被引:0
|
作者
Guan, Wenhao [1 ]
Li, Tao [1 ]
Li, Yishuang [2 ]
Huang, Hukai [1 ]
Hong, Qingyang [1 ]
Li, Lin [2 ,3 ]
机构
[1] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[3] Xiamen Univ, Sch Elect Sci & Engn, Xiamen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
speech synthesis; style transfer; variational autoencoder; diffusion probabilistic model;
D O I
10.21437/Interspeech.2023-1151
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-spectrograms. Specifically, a two-stage and a one-stage system are designed respectively, to improve the audio quality and the performance of style transfer. Secondly, a diffusion bridge of quantized VAE is designed to efficiently learn complex discrete style representations and improve the performance of style transfer. To have a better ability of style transfer, we introduce ControlVAE to improve the reconstruction quality and have good interpretability simultaneously. Experiments on LibriTTS dataset demonstrate that our method is more effective than baseline models
引用
收藏
页码:4304 / 4308
页数:5
相关论文
共 50 条
  • [1] ON-THE-FLY DATA AUGMENTATION FOR TEXT-TO-SPEECH STYLE TRANSFER
    Chung, Raymond
    Mak, Brian
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 634 - 641
  • [2] CROSS-LINGUAL TEXT-TO-SPEECH VIA HIERARCHICAL STYLE TRANSFER
    Lee, Sang-Hoon
    Choi, Ha-Yeong
    Lee, Seong-Whan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 25 - 26
  • [3] PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
    Liu, Guanghou
    Zhang, Yongmao
    Lei, Yi
    Chen, Yunlin
    Wang, Rui
    Li, Zhifei
    Xie, Lei
    INTERSPEECH 2023, 2023, : 4888 - 4892
  • [4] Expressive Text-to-Speech using Style Tag
    Kim, Minchan
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Jong Jin
    Kim, Nam Soo
    INTERSPEECH 2021, 2021, : 4663 - 4667
  • [5] GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
    Huang, Rongjie
    Ren, Yi
    Liu, Jinglin
    Cui, Chenye
    Zhao, Zhou
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] CROSS-SPEAKER STYLE TRANSFER FOR TEXT-TO-SPEECH USING DATA AUGMENTATION
    Ribeiro, Manuel Sam
    Roth, Julian
    Comini, Giulia
    Huybrechts, Goeric
    Gabrys, Adam
    Lorenzo-Trueba, Jaime
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6797 - 6801
  • [7] Prosodic reading style simulation for text-to-speech synthesis
    Jokisch, O
    Kruschke, H
    Hoffmann, R
    AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PROCEEDINGS, 2005, 3784 : 426 - 432
  • [8] Incorporating Cross-speaker Style Transfer for Multi-language Text-to-Speech
    Shang, Zengqiang
    Huang, Zhihua
    Zhang, Haozhe
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2021, 2021, : 1619 - 1623
  • [9] Interactive Text-to-Speech System via Joint Style Analysis
    Gao, Yang
    Zheng, Weiyi
    Yang, Zhaojun
    Koehler, Thilo
    Fuegen, Christian
    He, Qing
    INTERSPEECH 2020, 2020, : 4447 - 4451
  • [10] Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
    Paul, Dipjyoti
    Shifas, Muhammed P., V
    Pantazis, Yannis
    Stylianou, Yannis
    INTERSPEECH 2020, 2020, : 1361 - 1365