Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

被引:0
|
作者
Skerry-Ryan, R. J. [1 ]
Battenberg, Eric [1 ]
Xiao, Ying [1 ]
Wang, Yuxuan [1 ]
Stanton, Daisy [1 ]
Shor, Joel [1 ]
Weiss, Ron J. [1 ]
Clark, Rob [1 ]
Saurous, Rif A. [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 31 - 35
  • [32] End-to-end Indonesian Speech Synthesis Based On Transfer Learning And Alternate Training
    Lu, Yu
    Yang, Jian
    Yang, Ruolin
    2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 30 - 35
  • [33] LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS
    Zhang, Ya-Jie
    Pan, Shifeng
    He, Lei
    Ling, Zhen-Hua
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6945 - 6949
  • [34] TOWARDS LANGUAGE-UNIVERSAL END-TO-END SPEECH RECOGNITION
    Kim, Suyoun
    Seltzer, Michael L.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4914 - 4918
  • [35] Towards End-to-End Speech Recognition with Recurrent Neural Networks
    Graves, Alex
    Jaitly, Navdeep
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772
  • [36] Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
    Wang, Yuxuan
    Stanton, Daisy
    Zhang, Yu
    Skerry-Ryan, R. J.
    Battenberg, Eric
    Shor, Joel
    Xiao, Ying
    Ren, Fei
    Jia, Ye
    Saurous, Rif A.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [37] Analysis of Pronunciation Learning in End-to-End Speech Synthesis
    Taylor, Jason
    Richmond, Korin
    INTERSPEECH 2019, 2019, : 2070 - 2074
  • [38] Acoustic Word Embeddings for End-to-End Speech Synthesis
    Shen, Feiyu
    Du, Chenpeng
    Yu, Kai
    APPLIED SCIENCES-BASEL, 2021, 11 (19):
  • [39] End-to-end Speech Synthesis for Tibetan Lhasa Dialect
    Luo, Lisai
    Li, Guanyu
    Gong, Chunwei
    Ding, Hailan
    2018 INTERNATIONAL SYMPOSIUM ON POWER ELECTRONICS AND CONTROL ENGINEERING (ISPECE 2018), 2019, 1187
  • [40] End-to-End Speech Synthesis for Bangla with Text Normalization
    Pial, Tanzir Islam
    Aunti, Shahreen Salim
    Ahmed, Shabbir
    Heickal, Hasnain
    2018 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE/ INTELLIGENCE AND APPLIED INFORMATICS (CSII 2018), 2018, : 66 - 71