CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

被引:39
|
作者
Karlapati, Sri [1 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Klimkov, Viacheslav [1 ]
Sciez-Trigueros, Daniel [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
来源
关键词
Neural text-to-speech; fine-grained prosody transfer; many-to-many prosody transfer;
D O I
10.21437/Interspeech.2020-1251
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through various subjective evaluations, where we show a relative improvement of 47% in the quality of prosody transfer and 14% in preserving the target speaker identity, while still maintaining the same naturalness.
引用
收藏
页码:4387 / 4391
页数:5
相关论文
共 50 条
  • [1] CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer
    Karlapati, Sri
    Karanasou, Penny
    Lajszczak, Mateusz
    Abbas, Ammar
    Moinet, Alexis
    Makarov, Peter
    Li, Ray
    van Korlaar, Arent
    Slangen, Simon
    Drugman, Thomas
    INTERSPEECH 2022, 2022, : 3363 - 3367
  • [2] Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-To-Speech
    Klimkov, Viacheslav
    Ronanki, Srikanth
    Rohnke, Jonas
    Drugman, Thomas
    INTERSPEECH 2019, 2019, : 4440 - 4444
  • [3] NON-PARALLEL MANY-TO-MANY VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM A TEXT-TO-SPEECH MODEL
    Yu, Xinyuan
    Mak, Brian
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5924 - 5928
  • [4] GENERATING DIVERSE AND NATURAL TEXT-TO-SPEECH SAMPLES USING A QUANTIZED FINE-GRAINED VAE AND AUTOREGRESSIVE PROSODY PRIOR
    Sun, Guangzhi
    Zhang, Yu
    Weiss, Ron J.
    Cao, Yuan
    Zen, Heiga
    Rosenberg, Andrew
    Ramabhadran, Bhuvana
    Wu, Yonghui
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6699 - 6703
  • [5] Multi-stage attention for fine-grained expressivity transfer in multispeaker text-to-speech system
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 180 - 184
  • [6] FINE-GRAINED STYLE CONTROL IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS
    Chen, Li-Wei
    Rudnicky, Alexander
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7907 - 7911
  • [7] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
    Zou, Yuxiang
    Liu, Shichao
    Yin, Xiang
    Lin, Haopeng
    Wang, Chunfeng
    Zhang, Haoyu
    Ma, Zejun
    INTERSPEECH 2021, 2021, : 3146 - 3150
  • [8] PROSODYSPEECH: TOWARDS ADVANCED PROSODY MODEL FOR NEURAL TEXT-TO-SPEECH
    Yi, Yuanhao
    He, Lei
    Pan, Shifeng
    Wang, Xi
    Xiao, Yujia
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7582 - 7586
  • [9] EMOQ-TTS: EMOTION INTENSITY QUANTIZATION FOR FINE-GRAINED CONTROLLABLE EMOTIONAL TEXT-TO-SPEECH
    Im, Chae-Bin
    Lee, Sang-Hoon
    Kim, Seung-Bin
    Lee, Seong-Whan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6317 - 6321
  • [10] Towards Fine-grained Text Sentiment Transfer
    Luo, Fuli
    Li, Peng
    Yang, Pengcheng
    Zhou, Jie
    Tan, Yutong
    Chang, Baobao
    Sui, Zhifang
    Sun, Xu
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2013 - 2022