Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis

被引:52
|
作者
Wang, Xin [1 ]
Takaki, Shinji [2 ]
Yamagishi, Junichi [1 ,3 ]
机构
[1] Natl Inst Informat, Tokyo 1018340, Japan
[2] Nagoya Inst Technol, Nagoya, Aichi 4668555, Japan
[3] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9YL, Midlothian, Scotland
关键词
Training; Mathematical model; Acoustics; Computational modeling; Speech synthesis; Neural networks; neural network; waveform model; short-time Fourier transform; IDENTIFICATION;
D O I
10.1109/TASLP.2019.2956145
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural waveform models have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. One of the best models, called WaveNet, uses an autoregressive (AR) approach to model the distribution of waveform sampling points, but it has to generate a waveform in a time-consuming sequential manner. Some new models that use inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner but require either a larger amount of training time or a complicated model architecture plus a blend of training criteria. As an alternative to AR and IAF-based frameworks, we propose a neural source-filter (NSF) waveform modeling framework that is straightforward to train and fast to generate waveforms. This framework requires three components to generate waveforms: a source module that generates a sine-based signal as excitation, a non-AR dilated-convolution-based filter module that transforms the excitation into a waveform, and a conditional module that pre-processes the input acoustic features for the source and filter modules. This framework minimizes spectral-amplitude distances for model training, which can be efficiently implemented using short-time Fourier transform routines. As an initial NSF study, we designed three NSF models under the proposed framework and compared them with WaveNet using our deep learning toolkit. It was demonstrated that the NSF models generated waveforms at least 100 times faster than our WaveNet-vocoder, and the quality of the synthetic speech from the best NSF model was comparable to that from WaveNet on a large single-speaker Japanese speech corpus.
引用
收藏
页码:402 / 415
页数:14
相关论文
共 50 条
  • [1] NEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5916 - 5920
  • [2] Nonlinear interactive source-filter models for speech
    Koc, Turgay
    Ciloglu, Tolga
    [J]. COMPUTER SPEECH AND LANGUAGE, 2016, 36 : 365 - 394
  • [3] FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
    Bak, Taejun
    Bae, Jae-Sung
    Bae, Hanbin
    Kim, Young-Ik
    Cho, Hoon-Young
    [J]. INTERSPEECH 2021, 2021, : 116 - 120
  • [4] Autoregressive Models for Statistical Parametric Speech Synthesis
    Shannon, Matt
    Zen, Heiga
    Byrne, William
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (03): : 587 - 597
  • [5] Quantifying Parameters of a Source-Filter Model for Oesophageal Speech
    Toole, John M. O'
    Garcia Zapirain, Begona
    [J]. 2011 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2011, : 532 - 537
  • [6] Source-filter Separation of Speech Signal in the Phase Domain
    Loweimi, Erfan
    Barker, Jon
    Hain, Thomas
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 598 - 602
  • [7] Source-filter models for time-scale pitch-scale modification of speech
    Acero, A
    [J]. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 881 - 884
  • [8] VOICE SOURCE MODELLING USING DEEP NEURAL NETWORKS FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS
    Raitio, Tuomo
    Lu, Heng
    Kane, John
    Suni, Antti
    Vainio, Martti
    King, Simon
    Alku, Paavo
    [J]. 2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2290 - 2294
  • [9] Waveform generation based on signal reshaping for statistical parametric speech synthesis
    Espic, Felipe
    Valentini-Botinhao, Cassia
    Wu, Zhizheng
    King, Simon
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2263 - 2267
  • [10] THE EFFECT OF NEURAL NETWORKS IN STATISTICAL PARAMETRIC SPEECH SYNTHESIS
    Hashimoto, Kei
    Oura, Keiichiro
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4455 - 4459