Neural Speech Synthesis with Transformer Network

被引：0

作者：

Li, Naihan ^{[1
,3
,4
]}

Liu, Shujie ^{[2
]}

Liu, Yanqing ^{[3
]}

Zhao, Sheng ^{[3
]}

Liu, Ming ^{[1
,4
]}

机构：

[1] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China

[2] Microsoft Res Asia, Beijing, Peoples R China

[3] Microsoft STC Asia, Beijing, Peoples R China

[4] CETC Big Data Res Inst Co Ltd, Guiyang, Guizhou, Peoples R China

来源：

THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2019年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

引用

页码：6706 / 6713

页数：8

共 50 条

[1] IMPROVING END-TO-END SPEECH SYNTHESIS WITH LOCAL RECURRENT NEURAL NETWORK ENHANCED TRANSFORMER
Zheng, Yibin
Li, Xinhui
Xie, Fenglong
Lu, Li
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6734 - 6738
[2] The Transformer Neural Network Architecture for Part-of-Speech Tagging
Maksutov, Artem A.
Zamyatovskiy, Vladimir, I
Morozov, Viacheslav O.
Dmitriev, Sviatoslav O.
PROCEEDINGS OF THE 2021 IEEE CONFERENCE OF RUSSIAN YOUNG RESEARCHERS IN ELECTRICAL AND ELECTRONIC ENGINEERING (ELCONRUS), 2021, : 536 - 540
[3] PATNET : A PHONEME-LEVEL AUTOREGRESSIVE TRANSFORMER NETWORK FOR SPEECH SYNTHESIS
Wang, Shiming
Ling, Zhenhua
Fu, Ruibo
Yi, Jiangyan
Tao, Jianhua
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5684 - 5688
[4] Tibetan speech synthesis based on an improved neural network
Ding, Yuntao
Cai, Rangzhuoma
Gong, Baojia
2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336
[5] Speech recognition system of transformer improved by pre-parallel convolution Neural Network
Yue, Qi
Han, Zhan
Chu, Jing
Han, Xiaokai
Li, Peiwen
Deng, Xuhui
PROCEEDINGS OF 2022 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS AND AUTOMATION (IEEE ICMA 2022), 2022, : 928 - 933
[6] Neural network based autonomous control of a speech synthesis system
Panagiotopoulos, Dimokritos
Orovas, Christos
Syndoukas, Dimitrios
INTELLIGENT SYSTEMS WITH APPLICATIONS, 2022, 14
[7] Style Transplantation in Neural Network-based Speech Synthesis
Suzic, Sinisa B.
Delic, Tijana, V
Pekar, Darko J.
Delic, Vlado D.
Secujski, Milan S.
ACTA POLYTECHNICA HUNGARICA, 2019, 16 (06) : 171 - 189
[8] Research on Dungan speech synthesis based on Deep Neural Network
Chen, Lijia
Yang, Hongwu
Wang, Hui
2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 46 - 50
[9] A Comparison of Expressive Speech Synthesis Approaches based on Neural Network
Xue, Liumeng
Zhu, Xiaolian
An, Xiaochun
Xie, Lei
PROCEEDINGS OF THE JOINT WORKSHOP OF THE 4TH WORKSHOP ON AFFECTIVE SOCIAL MULTIMEDIA COMPUTING AND FIRST MULTI-MODAL AFFECTIVE COMPUTING OF LARGE-SCALE MULTIMEDIA DATA (ASMMC-MMAC'18), 2018, : 15 - 20
[10] Development of a neural network library for resource constrained speech synthesis
Menon, Sujeendran
Zarzycki, Pawel
Ganzha, Maria
Paprzycki, Marcin
2020 5TH IEEE INTERNATIONAL CONFERENCE ON RECENT ADVANCES AND INNOVATIONS IN ENGINEERING (IEEE - ICRAIE-2020), 2020,

← 1 2 3 4 5 →