Controllable speech synthesis by learning discrete phoneme-level prosodic representations

被引:2
|
作者
Ellinas, Nikolaos [1 ,3 ]
Christidou, Myrsini [1 ]
Vioni, Alexandra [1 ]
Sung, June Sig [2 ]
Chalamandaris, Aimilios [1 ]
Tsiakoulis, Pirros [1 ]
Mastorocostas, Paris [3 ]
机构
[1] Samsung Elect, Innoet, Athens, Greece
[2] Samsung Elect, Mobile Commun Business, Suwon, South Korea
[3] Univ West Attica, Dept Informat & Comp Engn, Athens, Greece
关键词
Controllable text-to-speech synthesis; Fine-grained control; Prosody control; Speaker adaptation;
D O I
10.1016/j.specom.2022.11.006
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme -level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to -speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
引用
收藏
页码:22 / 31
页数:10
相关论文
共 50 条
  • [1] PROSODIC CLUSTERING FOR PHONEME-LEVEL PROSODY CONTROL IN END-TO-END SPEECH SYNTHESIS
    Vioni, Alexandra
    Christidou, Myrsini
    Ellinas, Nikolaos
    Vamvoukakis, Georgios
    Kakoulidis, Panos
    Kim, Taehoon
    Sung, June Sig
    Park, Hyoungmin
    Chalamandaris, Aimilios
    Tsiakoulis, Pirros
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5719 - 5723
  • [2] Expressive Speech Animation Synthesis with Phoneme-Level Controls
    Deng, Z.
    Neumann, U.
    COMPUTER GRAPHICS FORUM, 2008, 27 (08) : 2096 - 2113
  • [3] Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
    Salesky, Elizabeth
    Sperber, Matthias
    Black, Alan W.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1835 - 1841
  • [4] PATNET : A PHONEME-LEVEL AUTOREGRESSIVE TRANSFORMER NETWORK FOR SPEECH SYNTHESIS
    Wang, Shiming
    Ling, Zhenhua
    Fu, Ruibo
    Yi, Jiangyan
    Tao, Jianhua
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5684 - 5688
  • [5] Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Zhu, Kexin
    Xiao, Jing
    2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 485 - 489
  • [6] Speech Synthesis Adaption Method Based on Phoneme-Level Speaker Embedding Under Small Data
    Xu Z.-H.
    Chen B.
    Zhang H.
    Yu K.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (05): : 1003 - 1017
  • [7] Cortical Measures of Phoneme-Level Speech Encoding Correlate with the Perceived Clarity of Natural Speech
    Di Liberto, Giovanni M.
    Crosse, Michael J.
    Lalor, Edmund C.
    ENEURO, 2018, 5 (02)
  • [8] Phoneme-level Text to Audio Synchronization on Speech Signals with Background Music
    Pedone, Agnes
    Burred, Juan Jose
    Maller, Simon
    Leveau, Pierre
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 440 - 443
  • [9] Learning Robust Latent Representations for Controllable Speech Synthesis
    Kumar, Shakti
    Pradeep, Jithin
    Zaidi, Hussain
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3562 - 3575
  • [10] Relationship between phoneme-level spectral acoustics and speech intelligibility in healthy speech: a systematic review
    Pommee, Timothy
    Balaguer, Mathieu
    Pinquier, Julien
    Mauclair, Julie
    Woisard, Virginie
    Speyer, Renee
    SPEECH LANGUAGE AND HEARING, 2021, 24 (02) : 105 - 132