Controllable speech synthesis by learning discrete phoneme-level prosodic representations

被引：2

作者：

Ellinas, Nikolaos ^{[1
,3
]}

Christidou, Myrsini ^{[1
]}

Vioni, Alexandra ^{[1
]}

Sung, June Sig ^{[2
]}

Chalamandaris, Aimilios ^{[1
]}

Tsiakoulis, Pirros ^{[1
]}

Mastorocostas, Paris ^{[3
]}

机构：

[1] Samsung Elect, Innoet, Athens, Greece

[2] Samsung Elect, Mobile Commun Business, Suwon, South Korea

[3] Univ West Attica, Dept Informat & Comp Engn, Athens, Greece

来源：

SPEECH COMMUNICATION | 2023年 / 146卷

关键词：

Controllable text-to-speech synthesis; Fine-grained control; Prosody control; Speaker adaptation;

D O I：

10.1016/j.specom.2022.11.006

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme -level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to -speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.

引用

页码：22 / 31

页数：10

共 50 条

[21] Phoneme Segmentation using Deep Learning for Speech Synthesis
Lee, Young Han
Yang, Jong-Yeol
Cho, Choongsang
Jung, Hyedong
PROCEEDINGS OF THE 2018 CONFERENCE ON RESEARCH IN ADAPTIVE AND CONVERGENT SYSTEMS (RACS 2018), 2018, : 59 - 61
[22] Controllable neural text-to-speech synthesis using intuitive prosodic features
Raitio, Tuomo
Rasipuram, Ramya
Castellani, Dan
INTERSPEECH 2020, 2020, : 4432 - 4436
[23] Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations
Oh, Yoori
Lee, Juheon
Han, Yoseob
Lee, Kyogu
INTERSPEECH 2023, 2023, : 4818 - 4822
[24] Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence
Luo, Xuan
Takamichi, Shinnosuke
Saito, Yuki
Koriyama, Tomoki
Saruwatari, Hiroshi
APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2024, 13 (01)
[25] LEARNING ACCENT REPRESENTATION WITH MULTI-LEVEL VAE TOWARDS CONTROLLABLE SPEECH SYNTHESIS
Melechovsky, Jan
Mehrish, Ambuj
Herremans, Dorien
Sisman, Berrak
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 928 - 935
[26] Fluent Personalized Speech Synthesis with Prosodic Word-Level Spontaneous Speech generation
Huang, Yi-Chin
Wu, Chung-Hsien
Shie, Ming-Ge
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 294 - 298
[27] Autoregressive Co-Training for Learning Discrete Speech Representations
Yeh, Sung-Lin
Tang, Hao
INTERSPEECH 2022, 2022, : 5000 - 5004
[28] UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLLABLE SPEECH SYNTHESIS
Guo, Yiwei
Du, Chenpeng
Yu, Kai
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7597 - 7601
[29] Learning Character-level Representations for Part-of-Speech Tagging
dos Santos, Cicero Nogueira
Zadrozny, Bianca
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1818 - 1826
[30] Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition
Huang, Jian
Tao, Jianhua
Liu, Bin
Lian, Zheng
INTERSPEECH 2020, 2020, : 4079 - 4083

← 1 2 3 4 5 →