ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS

被引：0

作者：

Cooper, Erica ^{[1
]}

Lai, Cheng-, I ^{[2
]}

Yasuda, Yusuke ^{[1
]}

Fang, Fuming ^{[1
]}

Wang, Xin ^{[1
]}

Chen, Nanxin ^{[3
]}

Yamagishi, Junichi ^{[1
]}

机构：

[1] Natl Inst Informat, Tokyo, Japan

[2] MIT, Cambridge, MA 02139 USA

[3] Johns Hopkins Univ, Baltimore, MD USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

Speech synthesis; speaker adaptation; speaker embeddings; transfer learning; speaker verification;

D O I：

10.1109/icassp40776.2020.9054535

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers. We investigate multi-speaker modeling for end-to-end text-to-speech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers. Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task; these embeddings also improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.

引用

页码：6184 / 6188

页数：5

共 50 条

[1] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
Zhang, Mingyang
Zhou, Xuehao
Wu, Zhizheng
Li, Haizhou
[J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
[2] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
Bang, Chae-Woon
Chun, Chanjun
[J]. SENSORS, 2023, 23 (23)
[3] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
Zhao, Botao
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
[4] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Casanova, Edresson
Shulby, Christopher
Golge, Eren
Muller, Nicolas Michael
de Oliveira, Frederico Santos
Candido Junior, Arnaldo
Soares, Anderson da Silva
Aluisio, Sandra Maria
Ponti, Moacir Antonelli
[J]. INTERSPEECH 2021, 2021, : 3645 - 3649
[5] Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech
Choi, Byoung Jin
Jeong, Myeonghun
Kim, Minchan
Mun, Sung Hwan
Kim, Nam Soo
[J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1708 - 1712
[6] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
Yoon, Hyungchan
Kim, Changhwan
Um, Seyun
Yoon, Hyun-Wook
Kang, Hong-Goo
[J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597
[7] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[8] Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech
Jeong, Myeonghun
Kim, Minchan
Choi, Byoung Jin
Yoon, Jaesam
Jang, Won
Kim, Nam Soo
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1519 - 1530
[9] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
Choi, Byoung Jin
Jeong, Myeonghun
Lee, Joun Yeop
Kim, Nam Soo
[J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2502 - 2506
[10] Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
Kumar, Neeraj
Goel, Srishti
Narang, Ankur
Lall, Brejesh
[J]. INTERSPEECH 2021, 2021, : 1354 - 1358

← 1 2 3 4 5 →