NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

被引:0
|
作者
Wang, Xin [1 ]
Takaki, Shinji [1 ]
Yamagishi, Junichi [1 ]
机构
[1] Natl Inst Informat, Tokyo, Japan
关键词
speech synthesis; neural network; waveform modeling;
D O I
10.1109/icassp.2019.8682298
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive ( AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria were essential to the performance of the proposed model.
引用
收藏
页码:5916 / 5920
页数:5
相关论文
共 50 条
  • [31] An introduction to statistical parametric speech synthesis
    King, Simon
    [J]. SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2011, 36 (05): : 837 - 852
  • [32] An introduction to statistical parametric speech synthesis
    Simon King
    [J]. Sadhana, 2011, 36 : 837 - 852
  • [33] Statistical Parametric Speech Synthesis: A Review
    Aroon, Athira
    Dhonde, S. B.
    [J]. PROCEEDINGS OF 2015 IEEE 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND CONTROL (ISCO), 2015,
  • [34] Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization
    Zen, Heiga
    Braunschweiler, Norbert
    Buchholz, Sabine
    Gales, Mark J. F.
    Knill, Kate
    Krstulovic, Sacha
    Latorre, Javier
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (06): : 1713 - 1724
  • [35] Statistical Parametric Speech Synthesis Based on Gaussian Process Regression
    Koriyama, Tomoki
    Nose, Takashi
    Kobayashi, Takao
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2014, 8 (02) : 173 - 183
  • [36] Ensemble Deep Neural Network Based Waveform-Driven Stress Model for Speech Synthesis
    Toth, Balint Pal
    Kis, Kornel Istvan
    Szaszak, Gyoergy
    Nemeth, Geza
    [J]. SPEECH AND COMPUTER, 2016, 9811 : 271 - 278
  • [37] Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis
    Ai, Yang
    Ling, Zhen-Hua
    Wu, Wei-Lu
    Li, Ang
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2036 - 2048
  • [38] Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    Zainko, Csaba
    Nemeth, Geza
    [J]. INTERSPEECH 2021, 2021, : 2212 - 2216
  • [39] A Continuous Vocoder Using Sinusoidal Model for Statistical Parametric Speech Synthesis
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    Nemeth, Geza
    [J]. SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 11 - 20
  • [40] A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis
    Ronanki, Srikanth
    Watts, Oliver
    King, Simon
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1133 - 1137