Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis

被引:0
|
作者
Jiang, Chenglong [1 ]
Gao, Ying [1 ]
Ng, Wing W. Y. [1 ]
Zhou, Jiyong [1 ]
Zhong, Jinghui [1 ]
Zhen, Hongzhong [1 ]
Hu, Xiping [2 ]
机构
[1] South China Univ Technol, Guangzhou 511442, Peoples R China
[2] Shenzhen MSU BIT Univ, Shenzhen 518172, Peoples R China
关键词
Semantic dependency; Local convolution; Tone; Naturalness; Text-to-speech synthesis;
D O I
10.1016/j.neucom.2024.128430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] A prosodic model for text-to-speech synthesis in French
    Di Cristo, A
    Di Cristo, P
    Campione, E
    Véronis, J
    INTONATION: ANALYSIS, MODELLING AND TECHNOLOGY, 2000, 15 : 321 - 355
  • [32] A stochastic model of intonation for text-to-speech synthesis
    Véronis, J
    Di Cristo, P
    Courtois, F
    Chaumette, C
    SPEECH COMMUNICATION, 1998, 26 (04) : 233 - 244
  • [33] FACTORIZED CONTEXT MODELLING FOR TEXT-TO-SPEECH SYNTHESIS
    Lu, Heng
    King, Simon
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7849 - 7853
  • [34] Text-to-speech synthesis with an Indian language perspective
    Panda, Soumya Priyadarsini
    Nayak, Ajit Kumar
    Patnaik, Srikanta
    INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2015, 6 (3-4) : 170 - 178
  • [35] Database processing for Spanish text-to-speech synthesis
    Gómez-Mena, J
    Cardo, M
    Madrid, JL
    Prades, C
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2000, 1902 : 248 - 252
  • [36] ASSIGNMENT OF SEGMENTAL DURATION IN TEXT-TO-SPEECH SYNTHESIS
    VANSANTEN, JPH
    COMPUTER SPEECH AND LANGUAGE, 1994, 8 (02): : 95 - 128
  • [37] Statistical Text-to-Speech Synthesis with Improved Dynamics
    Tiomkin, Stas
    Malah, David
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1841 - 1844
  • [38] RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis
    Zandie, Rohola
    Mahoor, Mohammad H.
    Madsen, Julia
    Emamian, Eshrat S.
    INTERSPEECH 2021, 2021, : 2751 - 2755
  • [39] Paraphrase generation to improve Text-To-Speech Synthesis
    Putois, Ghislain
    Chevelu, Jonathan
    Boidin, Cedric
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 198 - 201
  • [40] Spectral voice conversion for text-to-speech synthesis
    Kain, A
    Macon, MW
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 285 - 288