Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis

被引:0
|
作者
Jiang, Chenglong [1 ]
Gao, Ying [1 ]
Ng, Wing W. Y. [1 ]
Zhou, Jiyong [1 ]
Zhong, Jinghui [1 ]
Zhen, Hongzhong [1 ]
Hu, Xiping [2 ]
机构
[1] South China Univ Technol, Guangzhou 511442, Peoples R China
[2] Shenzhen MSU BIT Univ, Shenzhen 518172, Peoples R China
关键词
Semantic dependency; Local convolution; Tone; Naturalness; Text-to-speech synthesis;
D O I
10.1016/j.neucom.2024.128430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] A complete text-to-speech synthesis system in Tamil
    Rama, GLJ
    Ramakrishnan, AG
    Muralishankar, R
    Prathibha, R
    PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, 2002, : 191 - 194
  • [42] Diphone Databases for Lithuanian text-to-speech synthesis
    Kasparaitis, P
    INFORMATICA, 2005, 16 (02) : 193 - 202
  • [43] Wavelet analysis used in text-to-speech synthesis
    Kobayashi, M
    Sakamoto, M
    Saito, T
    Hashimoto, Y
    Nishimura, M
    Suzuki, K
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-ANALOG AND DIGITAL SIGNAL PROCESSING, 1998, 45 (08): : 1125 - 1129
  • [44] Statistical Text-to-Speech Synthesis of Spanish Subtitles
    Piqueras, S.
    del-Agua, M. A.
    Gimenez, A.
    Civera, J.
    Juan, A.
    ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, IBERSPEECH 2014, 2014, 8854 : 40 - 48
  • [45] A Generalized LR parser for text-to-speech synthesis
    Heggtveit, PO
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1429 - 1432
  • [46] THE SYNTHESIS RULES IN A CHINESE TEXT-TO-SPEECH SYSTEM
    LEE, LS
    TSENG, CY
    MING, OY
    IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (09): : 1309 - 1320
  • [47] CHARACTERIZATION OF RHYTHMIC PATTERNS FOR TEXT-TO-SPEECH SYNTHESIS
    BARBOSA, P
    BAILLY, G
    SPEECH COMMUNICATION, 1994, 15 (1-2) : 127 - 137
  • [48] Accented Text-to-Speech Synthesis With Limited Data
    Zhou, Xuehao
    Zhang, Mingyang
    Zhou, Yi
    Wu, Zhizheng
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1699 - 1711
  • [49] A single chip solution for text-to-speech synthesis
    Aktan, O
    Baskaya, IF
    Dündar, G
    Proceedings of the 2005 European Conference on Circuit Theory and Design, Vol 3, 2005, : 449 - 452
  • [50] Text-to-speech synthesis system for Punjabi language
    Dept. of Computer Sc. & Engg, Guru Nanak Dev Engg. College, Ludhiana
    Pb, India
    不详
    Pb, India
    Commun. Comput. Info. Sci., (302-303):