Controllable speech synthesis by learning discrete phoneme-level prosodic representations

被引：2

作者：

Ellinas, Nikolaos ^{[1
,3
]}

Christidou, Myrsini ^{[1
]}

Vioni, Alexandra ^{[1
]}

Sung, June Sig ^{[2
]}

Chalamandaris, Aimilios ^{[1
]}

Tsiakoulis, Pirros ^{[1
]}

Mastorocostas, Paris ^{[3
]}

机构：

[1] Samsung Elect, Innoet, Athens, Greece

[2] Samsung Elect, Mobile Commun Business, Suwon, South Korea

[3] Univ West Attica, Dept Informat & Comp Engn, Athens, Greece

来源：

SPEECH COMMUNICATION | 2023年 / 146卷

关键词：

Controllable text-to-speech synthesis; Fine-grained control; Prosody control; Speaker adaptation;

D O I：

10.1016/j.specom.2022.11.006

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme -level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to -speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.

引用

页码：22 / 31

页数：10

共 50 条

[41] Phone-Level Prosody Modelling With GMM-Based MDN for Diverse and Controllable Speech Synthesis
Du, Chenpeng
Yu, Kai
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 190 - 201
[42] IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS
Gong, Cheng
Wang, Longbiao
Ling, Zhenhua
Guo, Shaotong
Zhang, Ju
Dang, Jianwu
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5724 - 5728
[43] Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis
Ribeiro, Manuel Sam
Watts, Oliver
Yamagishi, Junichi
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3186 - 3190
[44] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
Tu, Tao
Chen, Yuan-Jui
Liu, Alexander H.
Lee, Hung-yi
INTERSPEECH 2020, 2020, : 3191 - 3195
[45] LEARNING UTTERANCE-LEVEL REPRESENTATIONS FOR SPEECH EMOTION AND AGE/GENDER RECOGNITION USING DEEP NEURAL NETWORKS
Wang, Zhong-Qiu
Tashev, Ivan
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5150 - 5154
[46] Improve emotional speech synthesis quality by learning explicit and implicit representations with semi-supervised training
He, Jiaxu
Gong, Cheng
Wang, Longbiao
Jin, Di
Wang, Xiaobao
Xu, Junhai
Dang, Jianwu
INTERSPEECH 2022, 2022, : 5538 - 5542
[47] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis
Liu, Zhaoci
Wu, Ningqian
Zhang, Yajie
Ling, Zhenhua
INTERSPEECH 2022, 2022, : 5508 - 5512
[48] Deep Learning Speech Synthesis Model for Word/Character-Level Recognition in the Tamil Language
Rajendran, Sukumar
Raja, Kiruba Thangam
Nagarajan, G.
Dass, A. Stephen
Kumar, M. Sandeep
Jayagopal, Prabhu
INTERNATIONAL JOURNAL OF E-COLLABORATION, 2023, 19 (04) : 20 - 20
[49] Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages
Madasamy, Anand Kumar
Padannayil, Soman Kutti
JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 14 (6) : 7207 - 7218
[50] Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages
Anand Kumar Madasamy
Soman Kutti Padannayil
Journal of Ambient Intelligence and Humanized Computing, 2023, 14 : 7207 - 7218

← 1 2 3 4 5 →