Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis

被引：2

作者：

Zhang, Mingyang ^{[1
]}

Zhou, Xuehao ^{[2
]}

Wu, Zhizheng ^{[1
]}

Li, Haizhou ^{[1
,2
]}

机构：

[1] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China

[2] Natl Univ Singapore, Singapore 117583, Singapore

来源：

IEEE SIGNAL PROCESSING LETTERS | 2023年 / 30卷

基金：

中国国家自然科学基金;

关键词：

Accent speech synthesis; limited data; multi accent modelling; text-to-speech;

D O I：

10.1109/LSP.2023.3292740

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This letter presents a framework towards multi-accent neural text-to-speech synthesis for zero-shot multi-speaker, which employs an encoder-decoder architecture and an accent classifier to control the pronunciation variation from the encoder. The encoder and decoder are pre-trained on a large-scale multi-speaker corpus. The accent-informed encoder outputs are taken by the attention-based decoder to generate accented prosody. This framework allows for fine-tuning with limited training data from multiple accents, and is able to generate accented speech for unseen speakers. Both objective and subjective evaluations confirm the effectiveness of the proposed framework.

引用

页码：947 / 951

页数：5

共 50 条

[31] Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Min, Dongchan
Lee, Dong Bok
Yang, Eunho
Hwang, Sung Ju
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[32] Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech
Nakai, Yusuke
Saito, Yuki
Udagawa, Kenta
Saruwatari, Hiroshi
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 743 - 748
[33] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Minchan
Lee, Hyeonseung
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
[34] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
Tu, Tao
Chen, Yuan-Jui
Liu, Alexander H.
Lee, Hung-yi
INTERSPEECH 2020, 2020, : 3191 - 3195
[35] Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
Zhou, Yixuan
Song, Changhe
Li, Xiang
Zhang, Luwen
Wu, Zhiyong
Bian, Yanyao
Su, Dan
Meng, Helen
INTERSPEECH 2022, 2022, : 2573 - 2577
[36] Multi-accent Speech Separation with One Shot Learning
Huang, Kuan Po
Wu, Yuan-Kuei
Lee, Hung-yi
1ST WORKSHOP ON META LEARNING AND ITS APPLICATIONS TO NATURAL LANGUAGE PROCESSING (METANLP 2021), 2021, : 59 - 66
[37] Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
Luong, Hieu-Thi
Wang, Xin
Yamagishi, Junichi
Nishizawa, Nobuyuki
INTERSPEECH 2019, 2019, : 1303 - 1307
[38] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
Peng, Puyuan
Huang, Po-Yao
Le, Shang-Wen
Mohamed, Abdelrahman
Harwath, David
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
[39] Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image
Goto, Shunsuke
Onishi, Kotaro
Saito, Yuki
Tachibana, Kentaro
Mori, Koichiro
INTERSPEECH 2020, 2020, : 1321 - 1325
[40] Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios
Xie, Qicong
Li, Tao
Wang, Xinsheng
Wang, Zhichao
Xie, Lei
Yu, Guoqiao
Wan, Guanglu
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 66 - 70

← 1 2 3 4 5 →