MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS

被引:0
|
作者
Fan, Yuchen [1 ]
Qian, Yao [1 ]
Soong, Frank K. [1 ]
He, Lei [1 ]
机构
[1] Microsoft Res Asia, Beijing, Peoples R China
关键词
statistical parametric speech synthesis; deep neural networks; multi-task learning; transfer learning; SPEECH SYNTHESIS;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In DNN-based TTS synthesis, DNNs hidden layers can be viewed as deep transformation for linguistic features and the output layers as representation of acoustic space to regress the transformed linguistic features to acoustic parameters. The deep-layered architectures of DNN can not only represent highly-complex transformation compactly, but also take advantage of huge amount of training data. In this paper, we propose an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker. The experimental results show that our approach can significantly improve the quality of synthesized speech objectively and subjectively, comparing with speech synthesized from the individual, speaker-dependent DNN-based TTS. We further transfer the hidden layers for a new speaker with limited training data and the resultant synthesized speech of the new speaker can also achieve a good quality in term of naturalness and speaker similarity.
引用
收藏
页码:4475 / 4479
页数:5
相关论文
共 50 条
  • [31] Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment
    Sivasankaran, Sunit
    Vincent, Emmanuel
    Fohr, Dominique
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2703 - 2707
  • [32] Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows
    Valles-Perez, Ivan
    Roth, Julian
    Beringer, Grzegorz
    Barra-Chicote, Roberto
    Droppo, Jasha
    [J]. INTERSPEECH 2021, 2021, : 3131 - 3135
  • [33] Sparse DNN-based speaker segmentation using side information
    Ma, Yong
    Bao, Chang-Chun
    [J]. ELECTRONICS LETTERS, 2015, 51 (08) : 651 - 653
  • [34] Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms
    Zhao, Wei
    Xu, Li
    He, Ting
    [J]. PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 7498 - 7503
  • [35] Zero-shot multi-speaker accent TTS with limited accent data
    Zhang, Mingyang
    Zhou, Yi
    Wu, Zhizheng
    Li, Haizhou
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936
  • [36] A hybrid approach to speaker recognition in multi-speaker environment
    Trivedi, J
    Maitra, A
    Mitra, SK
    [J]. PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275
  • [37] CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?
    Ogun, Sewade
    Colotte, Vincent
    Vincent, Emmanuel
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 900 - 905
  • [38] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
    Das, Rohan Kumar
    Yang, Jichen
    Li, Haizhou
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
  • [39] Automatic speaker clustering from multi-speaker utterances
    McLaughlin, J
    Reynolds, D
    Singer, E
    O'Leary, GC
    [J]. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 817 - 820
  • [40] SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DYSARTHRIC SPEECH RECOGNITION
    Soleymanpour, Mohammad
    Johnson, Michael T.
    Soleymanpour, Rahim
    Berry, Jeffrey
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7382 - 7386