Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis

被引:0
|
作者
Jiang, Chenglong [1 ]
Gao, Ying [1 ]
Ng, Wing W. Y. [1 ]
Zhou, Jiyong [1 ]
Zhong, Jinghui [1 ]
Zhen, Hongzhong [1 ]
Hu, Xiping [2 ]
机构
[1] South China Univ Technol, Guangzhou 511442, Peoples R China
[2] Shenzhen MSU BIT Univ, Shenzhen 518172, Peoples R China
关键词
Semantic dependency; Local convolution; Tone; Naturalness; Text-to-speech synthesis;
D O I
10.1016/j.neucom.2024.128430
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis
    Zhou, Yixuan
    Song, Changhe
    Li, Jingbei
    Wu, Zhiyong
    Bian, Yanyao
    Su, Dan
    Meng, Helen
    INTERSPEECH 2022, 2022, : 5518 - 5522
  • [2] Enhancing Local Dependencies for Transformer-Based Text-to-Speech via Hybrid Lightweight Convolution
    Zhao, Wei
    He, Ting
    Xu, Li
    IEEE ACCESS, 2021, 9 : 42762 - 42770
  • [3] A new Chinese text-to-speech system with high naturalness
    Wang, RH
    Liu, QF
    Tang, DF
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1441 - 1444
  • [4] A text-to-speech system with high intelligibility and naturalness for Chinese
    CHU Min and LU Shinan(Institute of Acoustics
    Chinese Journal of Acoustics, 1996, (01) : 81 - 90
  • [5] Combining concatenation and formant synthesis for improved intelligibility and naturalness in text-to-speech systems
    Pearson S.
    International Journal of Speech Technology, 1997, 1 (2) : 103 - 107
  • [6] Combining concatenation and formant synthesis for improved intelligibility and naturalness in text-to-speech systems
    Panasonic Technologies, Inc, Santa Barbara, United States
    Int J Speech Technol, 2 (103-107):
  • [7] TEXT-TO-SPEECH SYNTHESIS
    SPROAT, RW
    OLIVE, JP
    AT&T TECHNICAL JOURNAL, 1995, 74 (02): : 35 - 44
  • [8] Myanmar text-to-speech system with rule-based tone synthesis
    Win, Kyawt Yin
    Takara, Tomio
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2011, 32 (05) : 174 - 181
  • [9] Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
    Paul, Dipjyoti
    Shifas, Muhammed P., V
    Pantazis, Yannis
    Stylianou, Yannis
    INTERSPEECH 2020, 2020, : 1361 - 1365
  • [10] Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System
    Viet Lam Phung
    Huy Kinh Phan
    Anh Tuan Dinh
    Quoc Bao Nguyen
    PROCEEDINGS OF 2020 23RD CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (ORIENTAL-COCOSDA 2020), 2020, : 1 - 6