PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

被引:4
|
作者
Liu, Guanghou [1 ]
Zhang, Yongmao [1 ]
Lei, Yi [1 ]
Chen, Yunlin [2 ]
Wang, Rui [2 ]
Li, Zhifei [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Audio Speech & Language Proc Grp ASLP NPU, Sch Comp Sci, Xian, Peoples R China
[2] Shanghai Mobvoi Informat Technol Co Ltd, Shanghai, Peoples R China
来源
关键词
text-to-speech; style transfer; style prompt;
D O I
10.21437/Interspeech.2023-1779
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Style transfer TTS has shown impressive performance in recent years. However, style control is often restricted to systems built on expressive speech recordings with discrete style categories. In practical situations, users may be interested in transferring style by typing text descriptions of desired styles, without the reference speech in the target style. The text-guided content generation techniques have drawn wide attention recently. In this work, we explore the possibility of controllable style transfer with natural language descriptions. To this end, we propose PromptStyle, a text prompt-guided cross-speaker style transfer system. Specifically, PromptStyle consists of an improved VITS and a cross-modal style encoder. The cross-modal style encoder constructs a shared space of stylistic and semantic representation through a two-stage training process. Experiments show that PromptStyle can achieve proper style transfer with text prompts while maintaining relatively high stability and speaker similarity. Audio samples are available in our demo page(1).
引用
收藏
页码:4888 / 4892
页数:5
相关论文
共 50 条
  • [1] Natural language processing in a Japanese text-to-speech system for written-style texts
    Matsuoka, K
    Takeishi, E
    Asano, H
    THIRD IEEE WORKSHOP ON INTERACTIVE VOICE TECHNOLOGY FOR TELECOMMUNICATIONS APPLICATIONS - IVTTA-96, PROCEEDINGS, 1996, : 33 - 36
  • [2] Incorporating Cross-speaker Style Transfer for Multi-language Text-to-Speech
    Shang, Zengqiang
    Huang, Zhihua
    Zhang, Haozhe
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2021, 2021, : 1619 - 1623
  • [3] Controlling Emotion in Text-to-Speech with Natural Language Prompts
    Bott, Thomas
    Lux, Florian
    Vu, Ngoc Thang
    INTERSPEECH 2024, 2024, : 1795 - 1799
  • [4] ON-THE-FLY DATA AUGMENTATION FOR TEXT-TO-SPEECH STYLE TRANSFER
    Chung, Raymond
    Mak, Brian
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 634 - 641
  • [5] Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge
    Guan, Wenhao
    Li, Tao
    Li, Yishuang
    Huang, Hukai
    Hong, Qingyang
    Li, Lin
    INTERSPEECH 2023, 2023, : 4304 - 4308
  • [6] Text-to-speech for Slovak language
    Caky, P
    Klimo, M
    Mihálik, I
    Mladsik, R
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2004, 3206 : 291 - 298
  • [7] An overview of natural language processing techniques in text-to-speech systems
    Külekci, MO
    Oflazer, K
    PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, : 454 - 457
  • [8] CROSS-LINGUAL TEXT-TO-SPEECH VIA HIERARCHICAL STYLE TRANSFER
    Lee, Sang-Hoon
    Choi, Ha-Yeong
    Lee, Seong-Whan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 25 - 26
  • [9] Expressive Text-to-Speech using Style Tag
    Kim, Minchan
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Jong Jin
    Kim, Nam Soo
    INTERSPEECH 2021, 2021, : 4663 - 4667
  • [10] PROMPTTTS plus plus : CONTROLLING SPEAKER IDENTITY IN PROMPT-BASED TEXT-TO-SPEECH USING NATURAL LANGUAGE DESCRIPTIONS
    Shimizu, Reo
    Yamamoto, Ryuichi
    Kawamura, Masaya
    Shirahata, Yuma
    Doi, Hironori
    Komatsu, Tatsuya
    Tachibana, Kentaro
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 12672 - 12676