VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

被引:0
|
作者
Gudmalwar, Ashishkumar [1 ]
Shah, Nirmesh [1 ]
Akarsh, Sai [1 ]
Wasnik, Pankaj [1 ]
Shah, Rajiv Ratn [2 ]
机构
[1] Sony Res India Pvt Ltd, Bangalore, Karnataka, India
[2] Indraprastha Inst Informat Technol IIIT, Delhi, India
来源
关键词
Cross-lingual TTS; emotion; voice cloning;
D O I
10.21437/Interspeech.2024-1672
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).
引用
收藏
页码:3000 / 3004
页数:5
相关论文
共 50 条
  • [31] Exploring Cross-lingual Singing Voice Synthesis Using Speech Data
    Cao, Yuewen
    Liu, Songxiang
    Kang, Shiyin
    Hu, Na
    Liu, Peng
    Liu, Xunying
    Su, Dan
    Yu, Dong
    Meng, Helen
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [32] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Minchan
    Lee, Hyeonseung
    Kim, Nam Soo
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
  • [33] TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
    Seong, Donghyun
    Lee, Hoyoung
    Chang, Joon-Hyuk
    INTERSPEECH 2024, 2024, : 1780 - 1784
  • [34] LNACont: Language-normalized Affine Coupling Layer with contrastive learning for Cross-lingual Multi-speaker Text-to-speech
    Hwang, Sungwoong
    Kim, Changhwan
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 391 - 395
  • [35] Emo-TTS: Parallel Transformer-based Text-to-Speech Model with Emotional Awareness
    Osman, Mohamed
    5TH INTERNATIONAL CONFERENCE ON COMPUTING AND INFORMATICS (ICCI 2022), 2022, : 169 - 174
  • [36] AA SPECTRAL SPACE WARPING APPROACH TO CROSS-LINGUAL VOICE TRANSFORMATION IN HMM-BASED TTS
    Wang, Hao
    Soong, Frank
    Meng, Helen
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4874 - 4878
  • [37] STEN-TTS: Improving Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework
    Tran, Chung
    Luong, Chi Mai
    Sakti, Sakriani
    INTERSPEECH 2023, 2023, : 4464 - 4468
  • [38] Cross-Lingual Voice Conversion-Based Polyglot Speech Synthesizer for Indian Languages
    Ramani, B.
    Jeeva, Actlin M. P.
    Vijayalakshmi, P.
    Nagarajan, T.
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 775 - 779
  • [39] FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis
    Zhou, Xun
    Zhou, Zhiyang
    Shi, Xiaodong
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [40] Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
    Shin, Yookyung
    Lee, Younggun
    Jo, Suhee
    Hwang, Yeongtae
    Kim, Taesu
    INTERSPEECH 2022, 2022, : 2313 - 2317