Towards Multi-Scale Style Control for Expressive Speech Synthesis

被引：9

作者：

Li, Xiang ^{[1
]}

Song, Changhe ^{[1
]}

Li, Jingbei ^{[1
]}

Wu, Zhiyong ^{[1
,2
]}

Jia, Jia ^{[1
,3
]}

Meng, Helen ^{[1
,2
]}

机构：

[1] Tsinghua Univ, Shenzhen Int Grad Sch, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen, Peoples R China

[2] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China

[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

text-to-speech; expressive speech synthesis; prosody; multi-scale; speech style;

D O I：

10.21437/Interspeech.2021-947

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper introduces a multi-scale speech style modeling method for end-to-end expressive speech synthesis. The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech, which are then fed into the speech synthesis model as an extension to the input phoneme sequence. During training time, the multiscale style model could be jointly trained with the speech synthesis model in an end-to-end fashion. By applying the proposed method to style transfer task, experimental results indicate that the controllability of the multi-scale speech style model and the expressiveness of the synthesized speech are greatly improved. Moreover, by assigning different reference speeches to extraction of style on each scale, the flexibility of the proposed method is further revealed.

引用

页码：4673 / 4677

页数：5

共 50 条

[1] MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis
Lei, Shun
Zhou, Yixuan
Chen, Liyang
Wu, Zhiyong
Wu, Xixin
Kang, Shiyin
Meng, Helen
[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2023, 31 : 3290 - 3303
[2] Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Lei, Shun
Zhou, Yixuan
Chen, Liyang
Hu, Jiankun
Wu, Zhiyong
Kang, Shiyin
Meng, Helen
[J]. INTERSPEECH 2022, 2022, : 5523 - 5527
[3] Towards Expressive Speech Synthesis: Analysis and Modeling of Expressive Speech
Raptis, Spyros
Karabetsos, Sotiris
Chalamandaris, Aimilios
Tsiakoulis, Pirros
[J]. 2014 5th IEEE Conference on Cognitive Infocommunications (CogInfoCom), 2014, : 461 - 465
[4] A style control technique for HMM-based expressive speech synthesis
Nose, Takashi
Yamagishi, Junichi
Masuko, Takashi
Kobayashi, Takao
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (09) : 1406 - 1413
[5] MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis
Lei, Yi
Yang, Shan
Wang, Xinsheng
Xie, Lei
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 853 - 864
[6] TOWARDS EXPRESSIVE SPEAKING STYLE MODELLING WITH HIERARCHICAL CONTEXT INFORMATION FOR MANDARIN SPEECH SYNTHESIS
Lei, Shun
Zhou, Yixuan
Chen, Liyang
Wu, Zhiyong
Kang, Shiyin
Meng, Helen
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7922 - 7926
[7] M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing
Liu, Yan
Wei, Li -Fang
Qian, Xinyuan
Zhang, Tian-Hao
Chen, Song-Lu
Yin, Xu-Cheng
[J]. PATTERN RECOGNITION LETTERS, 2024, 179 : 158 - 164
[8] INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS
Cornille, Tobias
Wang, Fengna
Bekker, Jessa
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8312 - 8316
[9] Embroidery style transfer modeling based on multi-scale texture synthesis
Yao, Linhan
Zhang, Ying
Yao, Lan
Zheng, Xiaoping
Wei, Wenda
Liu, Chengxia
[J]. Fangzhi Xuebao/Journal of Textile Research, 2023, 44 (09): : 84 - 90
[10] Towards Glottal Source Controllability in Expressive Speech Synthesis
Lorenzo-Trueba, Jaime
Barra-Chicote, Roberto
Raitio, Tuomo
Obin, Nicolas
Alku, Paavo
Yamagishi, Junichi
Montero, Juan M.
[J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1618 - 1621

← 1 2 3 4 5 →