Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

被引：0

作者：

Guan, Wenhao ^{[1
]}

Li, Tao ^{[1
]}

Li, Yishuang ^{[2
]}

Huang, Hukai ^{[1
]}

Hong, Qingyang ^{[1
]}

Li, Lin ^{[2
,3
]}

机构：

[1] Xiamen Univ, Sch Informat, Xiamen, Peoples R China

[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China

[3] Xiamen Univ, Sch Elect Sci & Engn, Xiamen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

speech synthesis; style transfer; variational autoencoder; diffusion probabilistic model;

D O I：

10.21437/Interspeech.2023-1151

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-spectrograms. Specifically, a two-stage and a one-stage system are designed respectively, to improve the audio quality and the performance of style transfer. Secondly, a diffusion bridge of quantized VAE is designed to efficiently learn complex discrete style representations and improve the performance of style transfer. To have a better ability of style transfer, we introduce ControlVAE to improve the reconstruction quality and have good interpretability simultaneously. Experiments on LibriTTS dataset demonstrate that our method is more effective than baseline models

引用

页码：4304 / 4308

页数：5

共 50 条

[21] StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Li, Yinghao Aaron
Han, Cong
Raghavan, Vinay S.
Mischler, Gavin
Mesgarani, Nima
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[22] Lombard Speech Synthesis using Transfer Learning in a Tacotron Text-to-Speech System
Bollepalli, Bajibabu
Juvela, Lauri
Alku, Paavo
INTERSPEECH 2019, 2019, : 2833 - 2837
[23] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
Guan, Wenhao
Li, Yishuang
Li, Tao
Huang, Hukai
Wang, Feng
Lin, Jiayan
Huang, Lingyan
Li, Lin
Hong, Qingyang
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
[24] EXPRESSIVITY TRANSFER IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS
Hamed, Mohamed
Lachiri, Zied
2024 IEEE 7TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES, SIGNAL AND IMAGE PROCESSING, ATSIP 2024, 2024, : 443 - 448
[25] The clear speech intelligibility benefit for text-to-speech voices: Effects of speaking style and visual guise
Aoki, Nicholas B.
Cohn, Michelle
Zellou, Georgia
JASA EXPRESS LETTERS, 2022, 2 (04):
[26] PVAE-TTS: ADAPTIVE TEXT-TO-SPEECH VIA PROGRESSIVE STYLE ADAPTATION
Lee, Ji-Hyun
Lee, Sang-Hoon
Kim, Ji-Hoon
Lee, Seong-Whan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6312 - 6316
[27] LLM-based Expressive Text-to-Speech Synthesizer with Style and Timbre disentanglement
Zhu, Yuanyuan
He, Jiaxu
Jing, Ruihao
Song, Yaodong
Lian, Jie
Zhang, Xiao-Lei
Li, Jie
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 596 - 600
[28] JAPANESE TEXT-TO-SPEECH SYNTHESIZER
NAGAKURA, K
HAKODA, K
KABEYA, K
REVIEW OF THE ELECTRICAL COMMUNICATIONS LABORATORIES, 1988, 36 (05): : 451 - 457
[29] Multilingual text-to-speech synthesis
Black, AW
Lenzo, KA
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 761 - 764
[30] Slovenian text-to-speech system
Sef, T
ISCAS 2000: IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS - PROCEEDINGS, VOL V: EMERGING TECHNOLOGIES FOR THE 21ST CENTURY, 2000, : 41 - 44

← 1 2 3 4 5 →