Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

被引:0
|
作者
Skerry-Ryan, R. J. [1 ]
Battenberg, Eric [1 ]
Xiao, Ying [1 ]
Wang, Yuxuan [1 ]
Stanton, Daisy [1 ]
Shor, Joel [1 ]
Weiss, Ron J. [1 ]
Clark, Rob [1 ]
Saurous, Rif A. [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Tacotron: Towards End-to-End Speech Synthesis
    Wang, Yuxuan
    Skerry-Ryan, R. J.
    Stanton, Daisy
    Wu, Yonghui
    Weiss, Ron J.
    Jaitly, Navdeep
    Yang, Zongheng
    Xiao, Ying
    Chen, Zhifeng
    Bengio, Samy
    Quoc Le
    Agiomyrgiannakis, Yannis
    Clark, Rob
    Saurous, Rif A.
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 4006 - 4010
  • [2] Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
    Dai, Xudong
    Gong, Cheng
    Wang, Longbiao
    Zhang, Kaili
    INTERSPEECH 2021, 2021, : 131 - 135
  • [3] CE-Tacotron2: End-to-End Emotional Speech Synthesis
    Wang, Zhi
    Liu, Yinhua
    Shan, Liang
    2021 60TH ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS OF JAPAN (SICE), 2021, : 48 - 52
  • [4] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Pamisetty, Giridhar
    Murty, K. Sri Rama
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 361 - 384
  • [5] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Giridhar Pamisetty
    K. Sri Rama Murty
    Circuits, Systems, and Signal Processing, 2023, 42 : 361 - 384
  • [6] Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
    Li, Tao
    Wang, Xinsheng
    Xie, Qicong
    Wang, Zhichao
    Jiang, Mingqi
    Xie, Lei
    INTERSPEECH 2022, 2022, : 5498 - 5502
  • [7] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [8] Towards end-to-end speech recognition with transfer learning
    Qin, Chu-Xiong
    Qu, Dan
    Zhang, Lian-Hai
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2018,
  • [9] Towards end-to-end speech recognition with transfer learning
    Chu-Xiong Qin
    Dan Qu
    Lian-Hai Zhang
    EURASIP Journal on Audio, Speech, and Music Processing, 2018
  • [10] MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer
    Moon, Sungwoo
    Kim, Sunghyun
    Choi, Yong-Hoon
    IEEE ACCESS, 2022, 10 : 25455 - 25463