On the localness modeling for the self-attention based end-to-end speech synthesis

被引：0

作者：

Yang, Shan ^{[1
]}

Lu, Heng ^{[2
]}

Kang, Shiyin ^{[2
]}

Xue, Liumeng ^{[1
]}

Xiao, Jinba ^{[1
]}

Su, Dan ^{[2
]}

Xie, Lei ^{[1
]}

Yu, Dong ^{[2
]}

机构：

[1] Audio, Speech and Language Processing Group (ASLP@NPU), National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University, Xi'an, China

[2] Tencent AI Lab, China

来源：

Neural Networks | 2020年 / 125卷

关键词：

Gaussian distribution - Recurrent neural networks;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Attention based end-to-end speech synthesis achieves better performance in both prosody and quality compared to the conventional front-end–back-end structure. But training such end-to-end framework is usually time-consuming because of the use of recurrent neural networks. To enable parallel calculation and long-range dependency modeling, a solely self-attention based framework named Transformer is proposed recently in the end-to-end family. However, it lacks position information in sequential modeling, so that the extra position representation is crucial to achieve good performance. Besides, the weighted sum form of self-attention is conducted over the whole input sequence when computing latent representation, which may disperse the attention to the whole input sequence other than focusing on the more important neighboring input states, resulting in generation errors. In this paper, we introduce two localness modeling methods to enhance the self-attention based representation for speech synthesis, which maintain the abilities of parallel computation and global-range dependency modeling in self-attention while improving the generation stability. We systematically analyze the solely self-attention based end-to-end speech synthesis framework, and unveil the importance of local context. Then we add the proposed relative-position-aware method to enhance local edges and experiment with different architectures to examine the effectiveness of localness modeling. In order to achieve query-specific window and discard the hyper-parameter of the relative-position-aware approach, we further conduct Gaussian-based bias to enhance localness. Experimental results indicate that the two proposed localness enhanced methods can both improve the performance of the self-attention model, especially when applied to the encoder part. And the query-specific window of Gaussian bias approach is more robust compared with the fixed relative edges. © 2020 Elsevier Ltd

引用

下载

页码：121 / 130

共 50 条

[1] On the localness modeling for the self-attention based end-to-end speech synthesis
Yang, Shan
Lu, Heng
Kang, Shiyin
Xue, Liumeng
Xiao, Jinba
Su, Dan
Xie, Lei
Yu, Dong
NEURAL NETWORKS, 2020, 125 : 121 - 130
[2] Efficient decoding self-attention for end-to-end speech synthesis
Zhao, Wei
Xu, Li
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (07) : 1127 - 1138
[3] Self-Attention Transducers for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Wen, Zhengqi
INTERSPEECH 2019, 2019, : 4395 - 4399
[4] END-TO-END SPEECH SUMMARIZATION USING RESTRICTED SELF-ATTENTION
Sharma, Roshan
Palaskar, Shruti
Black, Alan W.
Metze, Florian
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8072 - 8076
[5] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
Luo, Haoneng
Zhang, Shiliang
Lei, Ming
Xie, Lei
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
[6] IMPROVING MANDARIN END-TO-END SPEECH SYNTHESIS BY SELF-ATTENTION AND LEARNABLE GAUSSIAN BIAS
Yang, Fengyu
Yang, Shan
Zhu, Pengcheng
Yan, Pengju
Xie, Lei
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 208 - 213
[7] Very Deep Self-Attention Networks for End-to-End Speech Recognition
Ngoc-Quan Pham
Thai-Son Nguyen
Niehues, Jan
Mueller, Markus
Waibel, Alex
INTERSPEECH 2019, 2019, : 66 - 70
[8] Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention
Liang, Chengdong
Xu, Menglong
Zhang, Xiao-Lei
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 2 : 1495 - 1499
[9] End-to-end Parking Behavior Recognition Based on Self-attention Mechanism
Li, Penghua
Zhu, Dechen
Mou, Qiyun
Tu, Yushan
Wu, Jinfeng
2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023, 2023, : 371 - 376
[10] End-to-End ASR with Adaptive Span Self-Attention
Chang, Xuankai
Subramanian, Aswin Shanmugam
Guo, Pengcheng
Watanabe, Shinji
Fujita, Yuya
Omachi, Motoi
INTERSPEECH 2020, 2020, : 3595 - 3599

← 1 2 3 4 5 →