NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

被引:0
|
作者
Wang, Xin [1 ]
Takaki, Shinji [1 ]
Yamagishi, Junichi [1 ]
机构
[1] Natl Inst Informat, Tokyo, Japan
关键词
speech synthesis; neural network; waveform modeling;
D O I
10.1109/icassp.2019.8682298
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive ( AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method. Given the input acoustic features, the proposed model first uses a source module to generate a sine-based excitation signal and then uses a filter module to transform the excitation signal into the output speech waveform. Our experiments demonstrated that the proposed model generated waveforms at least 100 times faster than the AR WaveNet and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria were essential to the performance of the proposed model.
引用
收藏
页码:5916 / 5920
页数:5
相关论文
共 50 条
  • [1] Using Cyclic Noise as the Source Signal for Neural Source-Filter-based Speech Waveform Model
    Wang, Xin
    Yamagishi, Junichi
    [J]. INTERSPEECH 2020, 2020, : 1992 - 1996
  • [2] Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 402 - 415
  • [3] Reverberation Modeling for Source-Filter-based Neural Vocoder
    Ai, Yang
    Wang, Xin
    Yamagishi, Junichi
    Ling, Zhen-Hua
    [J]. INTERSPEECH 2020, 2020, : 3560 - 3564
  • [4] Waveform generation based on signal reshaping for statistical parametric speech synthesis
    Espic, Felipe
    Valentini-Botinhao, Cassia
    Wu, Zhizheng
    King, Simon
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2263 - 2267
  • [5] Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis
    Bollepalli, Bajibabu
    Juvela, Lauri
    Alku, Paavo
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3394 - 3398
  • [6] SFNet: A Computationally Efficient Source Filter Model Based Neural Speech Synthesis
    Rao, Achuth M., V
    Ghosh, Prasanta Kumar
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2020, 27 (27) : 1170 - 1174
  • [7] GlotNet-A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis
    Juvela, Lauri
    Bollepalli, Bajibabu
    Tsiaras, Vassilis
    Alku, Paavo
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (06) : 1019 - 1030
  • [8] SAMPLERNN-BASED NEURAL VOCODER FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS
    Ai, Yang
    Wu, Hong-Chuan
    Ling, Zhen-Hua
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5659 - 5663
  • [9] Source-Filter-Based Single-Channel Speech Separation Using Pitch Information
    Stark, Michael
    Wohlmayr, Michael
    Pernkopf, Franz
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (02): : 242 - 255
  • [10] VOICE SOURCE MODELLING USING DEEP NEURAL NETWORKS FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS
    Raitio, Tuomo
    Lu, Heng
    Kane, John
    Suni, Antti
    Vainio, Martti
    King, Simon
    Alku, Paavo
    [J]. 2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2290 - 2294