Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual Speech Synthesis

被引:1
|
作者
Peng, Yukun [1 ]
Ling, Zhenhua [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei, Peoples R China
来源
基金
国家重点研发计划;
关键词
text-to-speech; speech synthesis; multilingual; meta-learning;
D O I
10.21437/Interspeech.2022-831
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a method of decoupled pronunciation and prosody modeling to improve the performance of meta-learning-based multilingual speech synthesis. The baseline meta-learning synthesis method adopts a single text encoder with a parameter generator conditioned on language embeddings and a single decoder to predict mel-spectrograms for all languages. In contrast, our proposed method designs a two-stream model structure that contains two encoders and two decoders for pronunciation and prosody modeling, respectively, considering that the pronunciation knowledge and the prosody knowledge should be shared in different ways among languages. In our experiments, our proposed method effectively improved the intelligibility and naturalness of multilingual speech synthesis comparing with the baseline meta-learning synthesis method.
引用
收藏
页码:4257 / 4261
页数:5
相关论文
共 50 条
  • [1] Multilingual context-based pronunciation learning for Text-to-Speech
    Comini, Giulia
    Ribeiro, Manuel Sam
    Yang, Fan
    Shim, Heereen
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2023, 2023, : 631 - 635
  • [2] Diction based prosody modeling in table-to-speech synthesis
    Spiliotopoulos, D
    Xydas, G
    Kouroupetroglou, G
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2005, 3658 : 294 - 301
  • [3] Modeling pronunciation variation for spontaneous speech synthesis
    Werner, S
    Wolff, M
    Eichner, M
    Hoffmann, R
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 673 - 676
  • [4] Prosody analysis and modeling for emotional speech synthesis
    Jiang, DN
    Zhang, W
    Shen, LQ
    Cai, LH
    2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 281 - 284
  • [5] CYBORG SPEECH: DEEP MULTILINGUAL SPEECH SYNTHESIS FOR GENERATING SEGMENTAL FOREIGN ACCENT WITH NATURAL PROSODY
    Henter, Gustav Eje
    Lorenzo-Trueba, Jaime
    Wang, Xin
    Kondo, Mariko
    Yamagishi, Junichi
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4799 - 4803
  • [6] Meta-learning-based approach for IoT data analytics
    Sairam Utukuru
    P Radha Krishna
    Sādhanā, 50 (2)
  • [7] A Meta-Learning-Based Train Dynamic Modeling Method for Accurately Predicting Speed and Position
    Cao, Ying
    Wang, Xi
    Zhu, Li
    Wang, Hongwei
    Wang, Xiaoning
    SUSTAINABILITY, 2023, 15 (11)
  • [8] Meta-Learning-Based Deep Reinforcement Learning for Multiobjective Optimization Problems
    Zhang, Zizhen
    Wu, Zhiyuan
    Zhang, Hang
    Wang, Jiahai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (10) : 7978 - 7991
  • [9] Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations
    Liu, Chang
    Ling, Zhen-Hua
    Chen, Ling-Hui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3706 - 3716
  • [10] Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling
    G. Bouselmi
    D. Fohr
    I. Illina
    International Journal of Speech Technology, 2012, 15 (2) : 203 - 213