VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

被引：0

作者：

Gudmalwar, Ashishkumar ^{[1
]}

Shah, Nirmesh ^{[1
]}

Akarsh, Sai ^{[1
]}

Wasnik, Pankaj ^{[1
]}

Shah, Rajiv Ratn ^{[2
]}

机构：

[1] Sony Res India Pvt Ltd, Bangalore, Karnataka, India

[2] Indraprastha Inst Informat Technol IIIT, Delhi, India

来源：

INTERSPEECH 2024 | 2024年

关键词：

Cross-lingual TTS; emotion; voice cloning;

D O I：

10.21437/Interspeech.2024-1672

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).

引用

页码：3000 / 3004

页数：5

共 50 条

[41] Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
Shin, Yookyung
Lee, Younggun
Jo, Suhee
Hwang, Yeongtae
Kim, Taesu
INTERSPEECH 2022, 2022, : 2313 - 2317
[42] Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech
Zbib, Rabih
Zhao, Lingjun
Karakos, Damianos
Hartmann, William
DeYoung, Jay
Huang, Zhongqiang
Jiang, Zhuolin
Rivkin, Noah
Zhang, Le
Schwartz, Richard
Makhoul, John
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 645 - 654
[43] LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
Kawamura, Masaya
Yamamoto, Ryuichi
Shirahata, Yuma
Hasumi, Takuya
Tachibana, Kentaro
INTERSPEECH 2024, 2024, : 1850 - 1854
[44] CROSS-SPEAKER STYLE TRANSFER FOR TEXT-TO-SPEECH USING DATA AUGMENTATION
Ribeiro, Manuel Sam
Roth, Julian
Comini, Giulia
Huybrechts, Goeric
Gabrys, Adam
Lorenzo-Trueba, Jaime
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6797 - 6801
[45] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
Guan, Wenhao
Li, Yishuang
Li, Tao
Huang, Hukai
Wang, Feng
Lin, Jiayan
Huang, Lingyan
Li, Lin
Hong, Qingyang
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
[46] M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing
Liu, Yan
Wei, Li -Fang
Qian, Xinyuan
Zhang, Tian-Hao
Chen, Song-Lu
Yin, Xu-Cheng
PATTERN RECOGNITION LETTERS, 2024, 179 : 158 - 164
[47] StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis
Chene, Zhiyong
Li, Xinnuo
Ai, Zhiqi
Xu, Shugong
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 263 - 277
[48] Incorporating Cross-speaker Style Transfer for Multi-language Text-to-Speech
Shang, Zengqiang
Huang, Zhihua
Zhang, Haozhe
Zhang, Pengyuan
Yan, Yonghong
INTERSPEECH 2021, 2021, : 1619 - 1623
[49] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Liu, Zhaoyu
Mak, Brian
INTERSPEECH 2020, 2020, : 2932 - 2936
[50] In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data
Prateek, Nishant
Lajszczak, Mateusz
Barra-Chicote, Roberto
Drugman, Thomas
Lorenzo-Trueba, Jaime
Merritt, Thomas
Ronanki, Srikanth
Wood, Trevor
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES(NAACL HLT 2019), VOL. 2 (INDUSTRY PAPERS), 2019, : 205 - 213

← 1 2 3 4 5 →