VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

被引:0
|
作者
Gudmalwar, Ashishkumar [1 ]
Shah, Nirmesh [1 ]
Akarsh, Sai [1 ]
Wasnik, Pankaj [1 ]
Shah, Rajiv Ratn [2 ]
机构
[1] Sony Res India Pvt Ltd, Bangalore, Karnataka, India
[2] Indraprastha Inst Informat Technol IIIT, Delhi, India
来源
关键词
Cross-lingual TTS; emotion; voice cloning;
D O I
10.21437/Interspeech.2024-1672
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).
引用
收藏
页码:3000 / 3004
页数:5
相关论文
共 50 条
  • [41] Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
    Shin, Yookyung
    Lee, Younggun
    Jo, Suhee
    Hwang, Yeongtae
    Kim, Taesu
    INTERSPEECH 2022, 2022, : 2313 - 2317
  • [42] Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech
    Zbib, Rabih
    Zhao, Lingjun
    Karakos, Damianos
    Hartmann, William
    DeYoung, Jay
    Huang, Zhongqiang
    Jiang, Zhuolin
    Rivkin, Noah
    Zhang, Le
    Schwartz, Richard
    Makhoul, John
    PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 645 - 654
  • [43] LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
    Kawamura, Masaya
    Yamamoto, Ryuichi
    Shirahata, Yuma
    Hasumi, Takuya
    Tachibana, Kentaro
    INTERSPEECH 2024, 2024, : 1850 - 1854
  • [44] CROSS-SPEAKER STYLE TRANSFER FOR TEXT-TO-SPEECH USING DATA AUGMENTATION
    Ribeiro, Manuel Sam
    Roth, Julian
    Comini, Giulia
    Huybrechts, Goeric
    Gabrys, Adam
    Lorenzo-Trueba, Jaime
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6797 - 6801
  • [45] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
    Guan, Wenhao
    Li, Yishuang
    Li, Tao
    Huang, Hukai
    Wang, Feng
    Lin, Jiayan
    Huang, Lingyan
    Li, Lin
    Hong, Qingyang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
  • [46] M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing
    Liu, Yan
    Wei, Li -Fang
    Qian, Xinyuan
    Zhang, Tian-Hao
    Chen, Song-Lu
    Yin, Xu-Cheng
    PATTERN RECOGNITION LETTERS, 2024, 179 : 158 - 164
  • [47] StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis
    Chene, Zhiyong
    Li, Xinnuo
    Ai, Zhiqi
    Xu, Shugong
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 263 - 277
  • [48] Incorporating Cross-speaker Style Transfer for Multi-language Text-to-Speech
    Shang, Zengqiang
    Huang, Zhihua
    Zhang, Haozhe
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2021, 2021, : 1619 - 1623
  • [49] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
    Liu, Zhaoyu
    Mak, Brian
    INTERSPEECH 2020, 2020, : 2932 - 2936
  • [50] In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data
    Prateek, Nishant
    Lajszczak, Mateusz
    Barra-Chicote, Roberto
    Drugman, Thomas
    Lorenzo-Trueba, Jaime
    Merritt, Thomas
    Ronanki, Srikanth
    Wood, Trevor
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES(NAACL HLT 2019), VOL. 2 (INDUSTRY PAPERS), 2019, : 205 - 213