A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization

被引：1

作者：

Cheon, Sung Jun ^{[1
,2
]}

Choi, Byoung Jin ^{[1
,2
]}

Kim, Minchan ^{[1
,2
]}

Lee, Hyeonseung ^{[1
,2
]}

Kim, Nam Soo ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea

[2] Seoul Natl Univ, Inst New Media & Commun, Seoul 08826, South Korea

来源：

IEEE SIGNAL PROCESSING LETTERS | 2022年 / 29卷

关键词：

Training; Upper bound; Speech synthesis; Correlation; Mutual information; Synthesizers; Estimation; Disentanglement; mutual information; speech synthesis; style modeling; total correlation;

D O I：

10.1109/LSP.2021.3125259

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this letter, we propose a multivariate information minimization method that disentangles three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency, which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate that the proposed method can improve the synthesizer in terms of quality as well as controllability.

引用

页码：55 / 59

页数：5

共 50 条

[31] MultiSpeech: Multi-Speaker Text to Speech with Transformer
Chen, Mingjian
Tan, Xu
Ren, Yi
Xu, Jin
Sun, Hao
Zhao, Sheng
Qin, Tao
INTERSPEECH 2020, 2020, : 4024 - 4028
[32] LIMMITS'24: MULTI-SPEAKER, MULTI-LINGUAL INDIC TTS WITH VOICE CLONING<bold> </bold>
Singh, Abhayjeet
Nagireddi, Amala
Deekshitha, G.
Bandekar, Jesuraja
Roopa, R.
Badiger, Sandhya
Udupa, Sathvik
Ghosh, Prasanta Kumar
Murthy, Hema A.
Kumar, Pranaw
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 61 - 62
[33] MnTTS2: An Open-Source Multi-speaker Mongolian Text-to-Speech Synthesis Dataset
Liang, Kailin
Liu, Bin
Hu, Yifan
Liu, Rui
Bao, Feilong
Gao, Guanglai
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 318 - 329
[34] ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS
Cooper, Erica
Lai, Cheng-, I
Yasuda, Yusuke
Fang, Fuming
Wang, Xin
Chen, Nanxin
Yamagishi, Junichi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6184 - 6188
[35] Adapter-Based Extension of Multi-Speaker Text-To-Speech Model for New Speakers
Hsieh, Cheng-Ping
Ghosh, Subhankar
Ginsburg, Boris
INTERSPEECH 2023, 2023, : 3028 - 3032
[36] Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 456 - 460
[37] Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Yoon, Hyungchan
Kim, Changhwan
Song, Eunwoo
Yoon, Hyun-Wook
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 4299 - 4303
[38] LNACont: Language-normalized Affine Coupling Layer with contrastive learning for Cross-lingual Multi-speaker Text-to-speech
Hwang, Sungwoong
Kim, Changhwan
32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 391 - 395
[39] Development of multi-lingual speech recognition and text-to speech synthesis for automotive applications
Deguchi, Y.
Kagoshima, T.
Hirabayashi, G.
Kanazawa, H.
Hogenhout, M.
VDI Berichte, 2003, (1789): : 3081 - 3088
[40] Development of multi-lingual speech recognition and text-to speech synthesis for automotive applications
Deguchi, Y
Kagoshima, T
Hirabayashi, G
Kanazawa, H
Hogenhout, M
ELECTRONIC SYSTEMS FOR VEHICLES, 2003, 1789 : 1167 - 1174

← 1 2 3 4 5 →