PROMPTTTS plus plus : CONTROLLING SPEAKER IDENTITY IN PROMPT-BASED TEXT-TO-SPEECH USING NATURAL LANGUAGE DESCRIPTIONS

被引:0
|
作者
Shimizu, Reo [1 ,2 ]
Yamamoto, Ryuichi [2 ,3 ]
Kawamura, Masaya [2 ,3 ]
Shirahata, Yuma [2 ,3 ]
Doi, Hironori [2 ,3 ]
Komatsu, Tatsuya [2 ,3 ]
Tachibana, Kentaro [2 ,3 ]
机构
[1] Tohoku Univ, Sendai, Miyagi, Japan
[2] LINE Corp, Tokyo, Japan
[3] LY Corp, Tokyo, Japan
关键词
Text-to-speech; speech synthesis; speaker generation; mixture model; diffusion model;
D O I
10.1109/ICASSP48485.2024.10448173
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.
引用
收藏
页码:12672 / 12676
页数:5
相关论文
共 8 条
  • [1] Controlling Emotion in Text-to-Speech with Natural Language Prompts
    Bott, Thomas
    Lux, Florian
    Vu, Ngoc Thang
    INTERSPEECH 2024, 2024, : 1795 - 1799
  • [2] PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
    Liu, Guanghou
    Zhang, Yongmao
    Lei, Yi
    Chen, Yunlin
    Wang, Rui
    Li, Zhifei
    Xie, Lei
    INTERSPEECH 2023, 2023, : 4888 - 4892
  • [3] Development of GUI for Text-to-Speech Recognition using Natural Language Processing
    Mukherjee, Partha
    Santra, Soumen
    Bhowmick, Subhajit
    Paul, Ananya
    Chatterjee, Pubali
    Deyasi, Arpan
    2018 2ND INTERNATIONAL CONFERENCE ON ELECTRONICS, MATERIALS ENGINEERING & NANO-TECHNOLOGY (IEMENTECH), 2018, : 195 - 198
  • [4] DCT-based amplitude and frequency modulated harmonic-plus-noise modelling for text-to-speech synthesis
    Hermus, Kris
    Van hamme, Hugo
    Verhelst, Werner
    Irhimeh, Sufian
    De Moorol, Jan
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 685 - +
  • [5] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
    Ali Raheem Mandeel
    Mohammed Salah Al-Radhi
    Tamás Gábor Csapó
    Multimedia Tools and Applications, 2023, 82 : 15635 - 15649
  • [6] A Novel Text-to-Speech Synthesis System Using Syllable-Based HMM for Tamil Language
    Manoharan, J. Samuel
    PROCEEDINGS OF SECOND INTERNATIONAL CONFERENCE ON SUSTAINABLE EXPERT SYSTEMS (ICSES 2021), 2022, 351 : 305 - 314
  • [7] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
    Mandeel, Ali Raheem
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (10) : 15635 - 15649
  • [8] Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthecis Systems Using a WaveNet Vocoder
    Zhao, Yi
    Takaki, Shinji
    Luong, Hieu-Thi
    Yamagishi, Junichi
    Saito, Daisuke
    Minematsu, Nobuaki
    IEEE ACCESS, 2018, 6 : 60478 - 60488