Towards Multi-Scale Style Control for Expressive Speech Synthesis

被引:9
|
作者
Li, Xiang [1 ]
Song, Changhe [1 ]
Li, Jingbei [1 ]
Wu, Zhiyong [1 ,2 ]
Jia, Jia [1 ,3 ]
Meng, Helen [1 ,2 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China
[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
来源
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
text-to-speech; expressive speech synthesis; prosody; multi-scale; speech style;
D O I
10.21437/Interspeech.2021-947
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper introduces a multi-scale speech style modeling method for end-to-end expressive speech synthesis. The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech, which are then fed into the speech synthesis model as an extension to the input phoneme sequence. During training time, the multiscale style model could be jointly trained with the speech synthesis model in an end-to-end fashion. By applying the proposed method to style transfer task, experimental results indicate that the controllability of the multi-scale speech style model and the expressiveness of the synthesized speech are greatly improved. Moreover, by assigning different reference speeches to extraction of style on each scale, the flexibility of the proposed method is further revealed.
引用
收藏
页码:4673 / 4677
页数:5
相关论文
共 50 条
  • [1] MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis
    Lei, Shun
    Zhou, Yixuan
    Chen, Liyang
    Wu, Zhiyong
    Wu, Xixin
    Kang, Shiyin
    Meng, Helen
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2023, 31 : 3290 - 3303
  • [2] Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
    Lei, Shun
    Zhou, Yixuan
    Chen, Liyang
    Hu, Jiankun
    Wu, Zhiyong
    Kang, Shiyin
    Meng, Helen
    [J]. INTERSPEECH 2022, 2022, : 5523 - 5527
  • [3] Towards Expressive Speech Synthesis: Analysis and Modeling of Expressive Speech
    Raptis, Spyros
    Karabetsos, Sotiris
    Chalamandaris, Aimilios
    Tsiakoulis, Pirros
    [J]. 2014 5th IEEE Conference on Cognitive Infocommunications (CogInfoCom), 2014, : 461 - 465
  • [4] A style control technique for HMM-based expressive speech synthesis
    Nose, Takashi
    Yamagishi, Junichi
    Masuko, Takashi
    Kobayashi, Takao
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (09) : 1406 - 1413
  • [5] MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis
    Lei, Yi
    Yang, Shan
    Wang, Xinsheng
    Xie, Lei
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 853 - 864
  • [6] TOWARDS EXPRESSIVE SPEAKING STYLE MODELLING WITH HIERARCHICAL CONTEXT INFORMATION FOR MANDARIN SPEECH SYNTHESIS
    Lei, Shun
    Zhou, Yixuan
    Chen, Liyang
    Wu, Zhiyong
    Kang, Shiyin
    Meng, Helen
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7922 - 7926
  • [7] M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing
    Liu, Yan
    Wei, Li -Fang
    Qian, Xinyuan
    Zhang, Tian-Hao
    Chen, Song-Lu
    Yin, Xu-Cheng
    [J]. PATTERN RECOGNITION LETTERS, 2024, 179 : 158 - 164
  • [8] INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS
    Cornille, Tobias
    Wang, Fengna
    Bekker, Jessa
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8312 - 8316
  • [9] Embroidery style transfer modeling based on multi-scale texture synthesis
    Yao, Linhan
    Zhang, Ying
    Yao, Lan
    Zheng, Xiaoping
    Wei, Wenda
    Liu, Chengxia
    [J]. Fangzhi Xuebao/Journal of Textile Research, 2023, 44 (09): : 84 - 90
  • [10] Towards Glottal Source Controllability in Expressive Speech Synthesis
    Lorenzo-Trueba, Jaime
    Barra-Chicote, Roberto
    Raitio, Tuomo
    Obin, Nicolas
    Alku, Paavo
    Yamagishi, Junichi
    Montero, Juan M.
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1618 - 1621