Neural Speech Synthesis with Transformer Network

被引:0
|
作者
Li, Naihan [1 ,3 ,4 ]
Liu, Shujie [2 ]
Liu, Yanqing [3 ]
Zhao, Sheng [3 ]
Liu, Ming [1 ,4 ]
机构
[1] Univ Elect Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
[3] Microsoft STC Asia, Beijing, Peoples R China
[4] CETC Big Data Res Inst Co Ltd, Guiyang, Guizhou, Peoples R China
来源
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2019年
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).
引用
收藏
页码:6706 / 6713
页数:8
相关论文
共 50 条
  • [1] IMPROVING END-TO-END SPEECH SYNTHESIS WITH LOCAL RECURRENT NEURAL NETWORK ENHANCED TRANSFORMER
    Zheng, Yibin
    Li, Xinhui
    Xie, Fenglong
    Lu, Li
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6734 - 6738
  • [2] The Transformer Neural Network Architecture for Part-of-Speech Tagging
    Maksutov, Artem A.
    Zamyatovskiy, Vladimir, I
    Morozov, Viacheslav O.
    Dmitriev, Sviatoslav O.
    PROCEEDINGS OF THE 2021 IEEE CONFERENCE OF RUSSIAN YOUNG RESEARCHERS IN ELECTRICAL AND ELECTRONIC ENGINEERING (ELCONRUS), 2021, : 536 - 540
  • [3] PATNET : A PHONEME-LEVEL AUTOREGRESSIVE TRANSFORMER NETWORK FOR SPEECH SYNTHESIS
    Wang, Shiming
    Ling, Zhenhua
    Fu, Ruibo
    Yi, Jiangyan
    Tao, Jianhua
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5684 - 5688
  • [4] Tibetan speech synthesis based on an improved neural network
    Ding, Yuntao
    Cai, Rangzhuoma
    Gong, Baojia
    2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336
  • [5] Speech recognition system of transformer improved by pre-parallel convolution Neural Network
    Yue, Qi
    Han, Zhan
    Chu, Jing
    Han, Xiaokai
    Li, Peiwen
    Deng, Xuhui
    PROCEEDINGS OF 2022 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS AND AUTOMATION (IEEE ICMA 2022), 2022, : 928 - 933
  • [6] Neural network based autonomous control of a speech synthesis system
    Panagiotopoulos, Dimokritos
    Orovas, Christos
    Syndoukas, Dimitrios
    INTELLIGENT SYSTEMS WITH APPLICATIONS, 2022, 14
  • [7] Style Transplantation in Neural Network-based Speech Synthesis
    Suzic, Sinisa B.
    Delic, Tijana, V
    Pekar, Darko J.
    Delic, Vlado D.
    Secujski, Milan S.
    ACTA POLYTECHNICA HUNGARICA, 2019, 16 (06) : 171 - 189
  • [8] Research on Dungan speech synthesis based on Deep Neural Network
    Chen, Lijia
    Yang, Hongwu
    Wang, Hui
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 46 - 50
  • [9] A Comparison of Expressive Speech Synthesis Approaches based on Neural Network
    Xue, Liumeng
    Zhu, Xiaolian
    An, Xiaochun
    Xie, Lei
    PROCEEDINGS OF THE JOINT WORKSHOP OF THE 4TH WORKSHOP ON AFFECTIVE SOCIAL MULTIMEDIA COMPUTING AND FIRST MULTI-MODAL AFFECTIVE COMPUTING OF LARGE-SCALE MULTIMEDIA DATA (ASMMC-MMAC'18), 2018, : 15 - 20
  • [10] Development of a neural network library for resource constrained speech synthesis
    Menon, Sujeendran
    Zarzycki, Pawel
    Ganzha, Maria
    Paprzycki, Marcin
    2020 5TH IEEE INTERNATIONAL CONFERENCE ON RECENT ADVANCES AND INNOVATIONS IN ENGINEERING (IEEE - ICRAIE-2020), 2020,