MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS

被引：0

作者：

Fan, Yuchen ^{[1
]}

Qian, Yao ^{[1
]}

Soong, Frank K. ^{[1
]}

He, Lei ^{[1
]}

机构：

[1] Microsoft Res Asia, Beijing, Peoples R China

来源：

2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP) | 2015年

关键词：

statistical parametric speech synthesis; deep neural networks; multi-task learning; transfer learning; SPEECH SYNTHESIS;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In DNN-based TTS synthesis, DNNs hidden layers can be viewed as deep transformation for linguistic features and the output layers as representation of acoustic space to regress the transformed linguistic features to acoustic parameters. The deep-layered architectures of DNN can not only represent highly-complex transformation compactly, but also take advantage of huge amount of training data. In this paper, we propose an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker. The experimental results show that our approach can significantly improve the quality of synthesized speech objectively and subjectively, comparing with speech synthesized from the individual, speaker-dependent DNN-based TTS. We further transfer the hidden layers for a new speaker with limited training data and the resultant synthesized speech of the new speaker can also achieve a good quality in term of naturalness and speaker similarity.

引用

页码：4475 / 4479

页数：5

共 50 条

[31] Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment
Sivasankaran, Sunit
Vincent, Emmanuel
Fohr, Dominique
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2703 - 2707
[32] Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows
Valles-Perez, Ivan
Roth, Julian
Beringer, Grzegorz
Barra-Chicote, Roberto
Droppo, Jasha
[J]. INTERSPEECH 2021, 2021, : 3131 - 3135
[33] Sparse DNN-based speaker segmentation using side information
Ma, Yong
Bao, Chang-Chun
[J]. ELECTRONICS LETTERS, 2015, 51 (08) : 651 - 653
[34] Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms
Zhao, Wei
Xu, Li
He, Ting
[J]. PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 7498 - 7503
[35] Zero-shot multi-speaker accent TTS with limited accent data
Zhang, Mingyang
Zhou, Yi
Wu, Zhizheng
Li, Haizhou
[J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936
[36] A hybrid approach to speaker recognition in multi-speaker environment
Trivedi, J
Maitra, A
Mitra, SK
[J]. PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275
[37] CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?
Ogun, Sewade
Colotte, Vincent
Vincent, Emmanuel
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 900 - 905
[38] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
Das, Rohan Kumar
Yang, Jichen
Li, Haizhou
[J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
[39] Automatic speaker clustering from multi-speaker utterances
McLaughlin, J
Reynolds, D
Singer, E
O'Leary, GC
[J]. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 817 - 820
[40] SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DYSARTHRIC SPEECH RECOGNITION
Soleymanpour, Mohammad
Johnson, Michael T.
Soleymanpour, Rahim
Berry, Jeffrey
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7382 - 7386

← 1 2 3 4 5 →