On the localness modeling for the self-attention based end-to-end speech synthesis

被引:0
|
作者
Yang, Shan [1 ]
Lu, Heng [2 ]
Kang, Shiyin [2 ]
Xue, Liumeng [1 ]
Xiao, Jinba [1 ]
Su, Dan [2 ]
Xie, Lei [1 ]
Yu, Dong [2 ]
机构
[1] Audio, Speech and Language Processing Group (ASLP@NPU), National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University, Xi'an, China
[2] Tencent AI Lab, China
关键词
Gaussian distribution - Recurrent neural networks;
D O I
暂无
中图分类号
学科分类号
摘要
Attention based end-to-end speech synthesis achieves better performance in both prosody and quality compared to the conventional front-end–back-end structure. But training such end-to-end framework is usually time-consuming because of the use of recurrent neural networks. To enable parallel calculation and long-range dependency modeling, a solely self-attention based framework named Transformer is proposed recently in the end-to-end family. However, it lacks position information in sequential modeling, so that the extra position representation is crucial to achieve good performance. Besides, the weighted sum form of self-attention is conducted over the whole input sequence when computing latent representation, which may disperse the attention to the whole input sequence other than focusing on the more important neighboring input states, resulting in generation errors. In this paper, we introduce two localness modeling methods to enhance the self-attention based representation for speech synthesis, which maintain the abilities of parallel computation and global-range dependency modeling in self-attention while improving the generation stability. We systematically analyze the solely self-attention based end-to-end speech synthesis framework, and unveil the importance of local context. Then we add the proposed relative-position-aware method to enhance local edges and experiment with different architectures to examine the effectiveness of localness modeling. In order to achieve query-specific window and discard the hyper-parameter of the relative-position-aware approach, we further conduct Gaussian-based bias to enhance localness. Experimental results indicate that the two proposed localness enhanced methods can both improve the performance of the self-attention model, especially when applied to the encoder part. And the query-specific window of Gaussian bias approach is more robust compared with the fixed relative edges. © 2020 Elsevier Ltd
引用
下载
收藏
页码:121 / 130
相关论文
共 50 条
  • [21] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
    Wu, Long
    Li, Ta
    Wang, Li
    Yan, Yonghong
    APPLIED SCIENCES-BASEL, 2019, 9 (21):
  • [22] Reinforcement-Tracking: An End-to-End Trajectory Tracking Method Based on Self-Attention Mechanism
    Zhao, Guanglei
    Chen, Zihao
    Liao, Weiming
    INTERNATIONAL JOURNAL OF AUTOMOTIVE TECHNOLOGY, 2024, 25 (03) : 541 - 551
  • [23] Reinforcement-Tracking: An End-to-End Trajectory Tracking Method Based on Self-Attention Mechanism
    Guanglei Zhao
    Zihao Chen
    Weiming Liao
    International Journal of Automotive Technology, 2024, 25 : 541 - 551
  • [24] An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
    Hong, Yong
    Li, Deren
    Luo, Shupei
    Chen, Xin
    Yang, Yi
    Wang, Mi
    REMOTE SENSING, 2022, 14 (24)
  • [25] IMPROVED END-TO-END SPOKEN UTTERANCE CLASSIFICATION WITH A SELF-ATTENTION ACOUSTIC CLASSIFIER
    Price, Ryan
    Mehrabani, Mahnoosh
    Srinivas
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8504 - 8508
  • [26] KNOWLEDGE DISTILLATION USING OUTPUT ERRORS FOR SELF-ATTENTION END-TO-END MODELS
    Kim, Ho-Gyeong
    Na, Hwidong
    Lee, Hoshik
    Lee, Jihyun
    Kang, Tae Gyoon
    Lee, Min-Joong
    Choi, Young Sang
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6181 - 6185
  • [27] Application of an end-to-end model with self-attention mechanism in cardiac disease prediction
    Li, Li
    Chen, Xi
    Hu, Sanjun
    FRONTIERS IN PHYSIOLOGY, 2024, 14
  • [28] An End-to-end Speech Recognition Algorithm based on Attention Mechanism
    Chen, Jia-nan
    Gao, Shuang
    Sun, Han-zhe
    Liu, Xiao-hui
    Wang, Zi-ning
    Zheng, Yan
    PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 2935 - 2940
  • [29] A Novel End-to-end Network Based on a bidirectional GRU and a Self-Attention Mechanism for Denoising of Electroencephalography Signals
    Wang, Wenlong
    Li, Baojiang
    Wang, Haiyan
    NEUROSCIENCE, 2022, 505 : 10 - 20
  • [30] TRIGGERED ATTENTION FOR END-TO-END SPEECH RECOGNITION
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5666 - 5670