Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation

被引:4
|
作者
Mitsui, Kentaro [1 ]
Koriyama, Tomoki [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, 7-3-1 Hongo, Tokyo 1138656, Japan
关键词
Text-to-speech synthesis; Multi-speaker modeling; Speaker representation; Gaussian process; Deep generative models;
D O I
10.1016/j.specom.2021.07.001
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes deep Gaussian process (DGP)-based frameworks for multi-speaker speech synthesis and speaker representation learning. A DGP has a deep architecture of Bayesian kernel regression, and it has been reported that DGP-based single speaker speech synthesis outperforms deep neural network (DNN)-based ones in the framework of statistical parametric speech synthesis. By extending this method to multiple speakers, it is expected that higher speech quality can be achieved with a smaller number of training utterances from each speaker. To apply DGPs to multi-speaker speech synthesis, we propose two methods: one using DGP with one-hot speaker codes, and the other using a deep Gaussian process latent variable model (DGPLVM). The DGP with one-hot speaker codes uses additional GP layers to transform speaker codes into latent speaker representations. The DGPLVM directly models the distribution of latent speaker representations and learns it jointly with acoustic model parameters. In this method, acoustic speaker similarity is expressed in terms of the similarity of the speaker representations, and thus, the voices of similar speakers are efficiently modeled. We experimentally evaluated the performance of the proposed methods in comparison with those of conventional DNN and variational autoencoder (VAE)-based frameworks, in terms of acoustic feature distortion and subjective speech quality. The experimental results demonstrate that (1) the proposed DGP-based and DGPLVM-based methods improve subjective speech quality compared with a feed-forward DNN-based method, and (2) even when the amount of training data for target speakers is limited, the DGPLVM-based method outperforms other methods, including the VAE-based one. Additionally, (3) by using a speaker representation randomly sampled from the learned speaker space, the DGPLVM-based method can generate voices of non-existent speakers.
引用
收藏
页码:132 / 145
页数:14
相关论文
共 50 条
  • [21] Unsupervised Speaker and Expression Factorization for Multi-Speaker Expressive Synthesis of Ebooks
    Chen, Langzhou
    Braunschweiler, Norbert
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1041 - 1045
  • [22] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [23] Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment
    Sivasankaran, Sunit
    Vincent, Emmanuel
    Fohr, Dominique
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2703 - 2707
  • [24] TOWARDS MULTI-SPEAKER UNSUPERVISED SPEECH PATTERN DISCOVERY
    Zhang, Yaodong
    Glass, James R.
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4366 - 4369
  • [25] SPEAKER CONDITIONING OF ACOUSTIC MODELS USING AFFINE TRANSFORMATION FOR MULTI-SPEAKER SPEECH RECOGNITION
    Yousefi, Midia
    Hansen, John H. L.
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 283 - 288
  • [26] MULTI-SPEAKER, NARROWBAND, CONTINUOUS MARATHI SPEECH DATABASE
    Godambe, Tejas
    Bondale, Nandini
    Samudravijaya, K.
    Rao, Preeti
    [J]. 2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [27] Speech Recognition and Multi-Speaker Diarization of Long Conversations
    Mao, Huanru Henry
    Li, Shuyang
    McAuley, Julian
    Cottrell, Garrison W.
    [J]. INTERSPEECH 2020, 2020, : 691 - 695
  • [28] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    [J]. INTERSPEECH 2021, 2021, : 2337 - 2338
  • [29] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
    Lu, Chunhui
    Wen, Xue
    Liu, Ruolan
    Chen, Xiao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
  • [30] GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
    Yang, Jinhyeok
    Bae, Jae-Sung
    Bak, Taejun
    Kim, Young-Ik
    Cho, Hoon-Young
    [J]. INTERSPEECH 2021, 2021, : 2202 - 2206