NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

被引：0

作者：

Wang, Xin ^{[1
]}

Takaki, Shinji ^{[1
]}

Yamagishi, Junichi ^{[1
]}

机构：

[1] Natl Inst Informat, Tokyo, Japan

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

speech synthesis; neural network; waveform modeling;

D O I：

10.1109/icassp.2019.8682298

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive ( AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria were essential to the performance of the proposed model.

引用

页码：5916 / 5920

页数：5

共 50 条

[1] Using Cyclic Noise as the Source Signal for Neural Source-Filter-based Speech Waveform Model
Wang, Xin
Yamagishi, Junichi
[J]. INTERSPEECH 2020, 2020, : 1992 - 1996
[2] Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis
Wang, Xin
Takaki, Shinji
Yamagishi, Junichi
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 402 - 415
[3] Reverberation Modeling for Source-Filter-based Neural Vocoder
Ai, Yang
Wang, Xin
Yamagishi, Junichi
Ling, Zhen-Hua
[J]. INTERSPEECH 2020, 2020, : 3560 - 3564
[4] Waveform generation based on signal reshaping for statistical parametric speech synthesis
Espic, Felipe
Valentini-Botinhao, Cassia
Wu, Zhizheng
King, Simon
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2263 - 2267
[5] Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis
Bollepalli, Bajibabu
Juvela, Lauri
Alku, Paavo
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3394 - 3398
[6] SFNet: A Computationally Efficient Source Filter Model Based Neural Speech Synthesis
Rao, Achuth M., V
Ghosh, Prasanta Kumar
[J]. IEEE SIGNAL PROCESSING LETTERS, 2020, 27 (27) : 1170 - 1174
[7] GlotNet-A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis
Juvela, Lauri
Bollepalli, Bajibabu
Tsiaras, Vassilis
Alku, Paavo
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (06) : 1019 - 1030
[8] SAMPLERNN-BASED NEURAL VOCODER FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS
Ai, Yang
Wu, Hong-Chuan
Ling, Zhen-Hua
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5659 - 5663
[9] Source-Filter-Based Single-Channel Speech Separation Using Pitch Information
Stark, Michael
Wohlmayr, Michael
Pernkopf, Franz
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (02): : 242 - 255
[10] VOICE SOURCE MODELLING USING DEEP NEURAL NETWORKS FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS
Raitio, Tuomo
Lu, Heng
Kane, John
Suni, Antti
Vainio, Martti
King, Simon
Alku, Paavo
[J]. 2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2290 - 2294

← 1 2 3 4 5 →