Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis

被引：2

作者：

Zhang, Mingyang ^{[1
]}

Zhou, Xuehao ^{[2
]}

Wu, Zhizheng ^{[1
]}

Li, Haizhou ^{[1
,2
]}

机构：

[1] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China

[2] Natl Univ Singapore, Singapore 117583, Singapore

来源：

IEEE SIGNAL PROCESSING LETTERS | 2023年 / 30卷

基金：

中国国家自然科学基金;

关键词：

Accent speech synthesis; limited data; multi accent modelling; text-to-speech;

D O I：

10.1109/LSP.2023.3292740

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This letter presents a framework towards multi-accent neural text-to-speech synthesis for zero-shot multi-speaker, which employs an encoder-decoder architecture and an accent classifier to control the pronunciation variation from the encoder. The encoder and decoder are pre-trained on a large-scale multi-speaker corpus. The accent-informed encoder outputs are taken by the attention-based decoder to generate accented prosody. This framework allows for fine-tuning with limited training data from multiple accents, and is able to generate accented speech for unseen speakers. Both objective and subjective evaluations confirm the effectiveness of the proposed framework.

引用

页码：947 / 951

页数：5

共 50 条

[1] Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Yoon, Hyungchan
Kim, Changhwan
Song, Eunwoo
Yoon, Hyun-Wook
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 4299 - 4303
[2] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[3] ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS
Cooper, Erica
Lai, Cheng-, I
Yasuda, Yusuke
Fang, Fuming
Wang, Xin
Chen, Nanxin
Yamagishi, Junichi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6184 - 6188
[4] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Casanova, Edresson
Shulby, Christopher
Golge, Eren
Muller, Nicolas Michael
de Oliveira, Frederico Santos
Candido Junior, Arnaldo
Soares, Anderson da Silva
Aluisio, Sandra Maria
Ponti, Moacir Antonelli
INTERSPEECH 2021, 2021, : 3645 - 3649
[5] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
Bang, Chae-Woon
Chun, Chanjun
SENSORS, 2023, 23 (23)
[6] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
Zhao, Botao
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
[7] Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
Kumar, Neeraj
Goel, Srishti
Narang, Ankur
Lall, Brejesh
INTERSPEECH 2021, 2021, : 1354 - 1358
[8] Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech
Jeong, Myeonghun
Kim, Minchan
Choi, Byoung Jin
Yoon, Jaesam
Jang, Won
Kim, Nam Soo
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1519 - 1530
[9] Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech
Choi, Byoung Jin
Jeong, Myeonghun
Kim, Minchan
Mun, Sung Hwan
Kim, Nam Soo
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1708 - 1712
[10] Zero-shot multi-speaker accent TTS with limited accent data
Zhang, Mingyang
Zhou, Yi
Wu, Zhizheng
Li, Haizhou
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936

← 1 2 3 4 5 →