VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

被引：0

作者：

Gudmalwar, Ashishkumar ^{[1
]}

Shah, Nirmesh ^{[1
]}

Akarsh, Sai ^{[1
]}

Wasnik, Pankaj ^{[1
]}

Shah, Rajiv Ratn ^{[2
]}

机构：

[1] Sony Res India Pvt Ltd, Bangalore, Karnataka, India

[2] Indraprastha Inst Informat Technol IIIT, Delhi, India

来源：

INTERSPEECH 2024 | 2024年

关键词：

Cross-lingual TTS; emotion; voice cloning;

D O I：

10.21437/Interspeech.2024-1672

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).

引用

页码：3000 / 3004

页数：5

共 50 条

[31] Exploring Cross-lingual Singing Voice Synthesis Using Speech Data
Cao, Yuewen
Liu, Songxiang
Kang, Shiyin
Hu, Na
Liu, Peng
Liu, Xunying
Su, Dan
Yu, Dong
Meng, Helen
2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[32] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Minchan
Lee, Hyeonseung
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
[33] TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
Seong, Donghyun
Lee, Hoyoung
Chang, Joon-Hyuk
INTERSPEECH 2024, 2024, : 1780 - 1784
[34] LNACont: Language-normalized Affine Coupling Layer with contrastive learning for Cross-lingual Multi-speaker Text-to-speech
Hwang, Sungwoong
Kim, Changhwan
32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 391 - 395
[35] Emo-TTS: Parallel Transformer-based Text-to-Speech Model with Emotional Awareness
Osman, Mohamed
5TH INTERNATIONAL CONFERENCE ON COMPUTING AND INFORMATICS (ICCI 2022), 2022, : 169 - 174
[36] AA SPECTRAL SPACE WARPING APPROACH TO CROSS-LINGUAL VOICE TRANSFORMATION IN HMM-BASED TTS
Wang, Hao
Soong, Frank
Meng, Helen
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4874 - 4878
[37] STEN-TTS: Improving Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework
Tran, Chung
Luong, Chi Mai
Sakti, Sakriani
INTERSPEECH 2023, 2023, : 4464 - 4468
[38] Cross-Lingual Voice Conversion-Based Polyglot Speech Synthesizer for Indian Languages
Ramani, B.
Jeeva, Actlin M. P.
Vijayalakshmi, P.
Nagarajan, T.
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 775 - 779
[39] FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis
Zhou, Xun
Zhou, Zhiyang
Shi, Xiaodong
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[40] Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
Shin, Yookyung
Lee, Younggun
Jo, Suhee
Hwang, Yeongtae
Kim, Taesu
INTERSPEECH 2022, 2022, : 2313 - 2317

← 1 2 3 4 5 →