NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

被引：0

作者：

Wang, Xin ^{[1
]}

Takaki, Shinji ^{[1
]}

Yamagishi, Junichi ^{[1
]}

机构：

[1] Natl Inst Informat, Tokyo, Japan

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

speech synthesis; neural network; waveform modeling;

D O I：

10.1109/icassp.2019.8682298

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive ( AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria were essential to the performance of the proposed model.

引用

页码：5916 / 5920

页数：5

共 50 条

[31] An introduction to statistical parametric speech synthesis
King, Simon
[J]. SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2011, 36 (05): : 837 - 852
[32] An introduction to statistical parametric speech synthesis
Simon King
[J]. Sadhana, 2011, 36 : 837 - 852
[33] Statistical Parametric Speech Synthesis: A Review
Aroon, Athira
Dhonde, S. B.
[J]. PROCEEDINGS OF 2015 IEEE 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND CONTROL (ISCO), 2015,
[34] Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization
Zen, Heiga
Braunschweiler, Norbert
Buchholz, Sabine
Gales, Mark J. F.
Knill, Kate
Krstulovic, Sacha
Latorre, Javier
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (06): : 1713 - 1724
[35] Statistical Parametric Speech Synthesis Based on Gaussian Process Regression
Koriyama, Tomoki
Nose, Takashi
Kobayashi, Takao
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2014, 8 (02) : 173 - 183
[36] Ensemble Deep Neural Network Based Waveform-Driven Stress Model for Speech Synthesis
Toth, Balint Pal
Kis, Kornel Istvan
Szaszak, Gyoergy
Nemeth, Geza
[J]. SPEECH AND COMPUTER, 2016, 9811 : 271 - 278
[37] Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis
Ai, Yang
Ling, Zhen-Hua
Wu, Wei-Lu
Li, Ang
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2036 - 2048
[38] Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
Zainko, Csaba
Nemeth, Geza
[J]. INTERSPEECH 2021, 2021, : 2212 - 2216
[39] A Continuous Vocoder Using Sinusoidal Model for Statistical Parametric Speech Synthesis
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
Nemeth, Geza
[J]. SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 11 - 20
[40] A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis
Ronanki, Srikanth
Watts, Oliver
King, Simon
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1133 - 1137

← 1 2 3 4 5 →