Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

被引：0

作者：

Skerry-Ryan, R. J. ^{[1
]}

Battenberg, Eric ^{[1
]}

Xiao, Ying ^{[1
]}

Wang, Yuxuan ^{[1
]}

Stanton, Daisy ^{[1
]}

Shor, Joel ^{[1
]}

Weiss, Ron J. ^{[1
]}

Clark, Rob ^{[1
]}

Saurous, Rif A. ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80 | 2018年 / 80卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

引用

页数：10

共 50 条

[1] Tacotron: Towards End-to-End Speech Synthesis
Wang, Yuxuan
Skerry-Ryan, R. J.
Stanton, Daisy
Wu, Yonghui
Weiss, Ron J.
Jaitly, Navdeep
Yang, Zongheng
Xiao, Ying
Chen, Zhifeng
Bengio, Samy
Quoc Le
Agiomyrgiannakis, Yannis
Clark, Rob
Saurous, Rif A.
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 4006 - 4010
[2] Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
Dai, Xudong
Gong, Cheng
Wang, Longbiao
Zhang, Kaili
INTERSPEECH 2021, 2021, : 131 - 135
[3] CE-Tacotron2: End-to-End Emotional Speech Synthesis
Wang, Zhi
Liu, Yinhua
Shan, Liang
2021 60TH ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS OF JAPAN (SICE), 2021, : 48 - 52
[4] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
Pamisetty, Giridhar
Murty, K. Sri Rama
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 361 - 384
[5] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
Giridhar Pamisetty
K. Sri Rama Murty
Circuits, Systems, and Signal Processing, 2023, 42 : 361 - 384
[6] Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
Li, Tao
Wang, Xinsheng
Xie, Qicong
Wang, Zhichao
Jiang, Mingqi
Xie, Lei
INTERSPEECH 2022, 2022, : 5498 - 5502
[7] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
Liu, Da-Rong
Yang, Chi-Yu
Wu, Szu-Lin
Lee, Hung-Yi
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
[8] Towards end-to-end speech recognition with transfer learning
Qin, Chu-Xiong
Qu, Dan
Zhang, Lian-Hai
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2018,
[9] Towards end-to-end speech recognition with transfer learning
Chu-Xiong Qin
Dan Qu
Lian-Hai Zhang
EURASIP Journal on Audio, Speech, and Music Processing, 2018
[10] MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer
Moon, Sungwoo
Kim, Sunghyun
Choi, Yong-Hoon
IEEE ACCESS, 2022, 10 : 25455 - 25463

← 1 2 3 4 5 →