Controllable speech synthesis by learning discrete phoneme-level prosodic representations

被引:2
|
作者
Ellinas, Nikolaos [1 ,3 ]
Christidou, Myrsini [1 ]
Vioni, Alexandra [1 ]
Sung, June Sig [2 ]
Chalamandaris, Aimilios [1 ]
Tsiakoulis, Pirros [1 ]
Mastorocostas, Paris [3 ]
机构
[1] Samsung Elect, Innoet, Athens, Greece
[2] Samsung Elect, Mobile Commun Business, Suwon, South Korea
[3] Univ West Attica, Dept Informat & Comp Engn, Athens, Greece
关键词
Controllable text-to-speech synthesis; Fine-grained control; Prosody control; Speaker adaptation;
D O I
10.1016/j.specom.2022.11.006
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme -level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to -speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
引用
收藏
页码:22 / 31
页数:10
相关论文
共 50 条
  • [31] Survey on low-level controllable image synthesis with deep learning
    Zhang, Shixiong
    Li, Jiao
    Yang, Lu
    ELECTRONIC RESEARCH ARCHIVE, 2023, 31 (12): : 7385 - 7426
  • [32] Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network
    Bird, Jordan J.
    Ekart, Aniko
    Faria, Diego R.
    ADVANCES IN COMPUTATIONAL INTELLIGENCE SYSTEMS (UKCI 2019), 2020, 1043 : 271 - 282
  • [33] Unsupervised Learning of Discrete Latent Representations with Data-Adaptive Dimensionality from Continuous Speech Streams
    Takahashi, Shun
    Sakti, Sakriani
    INTERSPEECH 2023, 2023, : 416 - 420
  • [34] Is infant-directed speech interesting because it is surprising? - Linking properties of IDS to statistical learning and attention at the prosodic level
    Rasanen, Okko
    Kakouros, Sofoklis
    Soderstrom, Melanie
    COGNITION, 2018, 178 : 193 - 206
  • [35] Co-Speech Gesture Synthesis using Discrete Gesture Token Learning
    Lu, Shuhong
    Yoon, Youngwoo
    Feng, Andrew
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 9808 - 9815
  • [36] Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations
    Cho, Jaejin
    Pappagari, Raghavendra
    Zelasko, Piotr
    Velazquez, Laureano Moro
    Villalba, Jesus
    Dehak, Najim
    INTERSPEECH 2022, 2022, : 4028 - 4032
  • [37] Active Learning for the Prediction of Prosodic Phrase Boundaries in Chinese Speech Synthesis Systems Using Conditional Random Fields
    Zhao, Ziping
    Ma, Xirong
    2015 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2015, : 199 - 203
  • [38] LEARNING HIERARCHICAL REPRESENTATIONS FOR EXPRESSIVE SPEAKING STYLE IN END-TO-END SPEECH SYNTHESIS
    An, Xiaochun
    Wang, Yuxuan
    Yang, Shan
    Ma, Zejun
    Xie, Lei
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 184 - 191
  • [39] LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS
    Zhang, Ya-Jie
    Pan, Shifeng
    He, Lei
    Ling, Zhen-Hua
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6945 - 6949
  • [40] ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations
    Gong, Cheng
    Wang, Xin
    Cooper, Erica
    Wells, Dan
    Wang, Longbiao
    Dang, Jianwu
    Richmond, Korin
    Yamagishi, Junichi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4036 - 4051