Improving Trajectory Modelling for DNN-Based Speech Synthesis by Using Stacked Bottleneck Features and Minimum Generation Error Training

被引:24
|
作者
Wu, Zhizheng [1 ]
King, Simon [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH89AB, Midlothian, Scotland
基金
英国工程与自然科学研究理事会;
关键词
Acoustic modelling; bottleneck; deep neural network; minimum generation error; speech synthesis; DEEP NEURAL-NETWORKS; HMM;
D O I
10.1109/TASLP.2016.2551865
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose two novel techniques-stacking bottleneck features and minimum generation error (MGE) training criterion-to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The MGE training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We present both objective and subjective results that demonstrate the effectiveness of the proposed techniques. The subjective results show that combining the two techniques leads to significantly more natural synthetic speech than from conventional DNN or long short-term memory recurrent neural network systems.
引用
收藏
页码:1255 / 1265
页数:11
相关论文
共 49 条
  • [1] Towards minimum perceptual error training for DNN-based speech synthesis
    Valentini-Botinhao, Cassia
    Wu, Zhizheng
    King, Simon
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 869 - 873
  • [2] Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features
    Wu, Zhizheng
    King, Simon
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 309 - 313
  • [3] DNN-Based Speech Synthesis: Importance of Input Features and Training Data
    Lazaridis, Alexandros
    Potard, Blaise
    Garner, Philip N.
    [J]. SPEECH AND COMPUTER (SPECOM 2015), 2015, 9319 : 193 - 200
  • [4] DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation
    Houidhek, Amal
    Colotte, Vincent
    Mnasri, Zied
    Jouvet, Denis
    [J]. STATISTICAL LANGUAGE AND SPEECH PROCESSING, SLSP 2018, 2018, 11171 : 9 - 20
  • [5] On the Training of DNN-based Average Voice Model for Speech Synthesis
    Yang, Shan
    Wu, Zhizheng
    Xie, Lei
    [J]. 2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [6] DNN-Based Speech Synthesis Using Speaker Codes
    Hojo, Nobukatsu
    Ijima, Yusuke
    Mizuno, Hideyuki
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (02): : 462 - 472
  • [7] DNN-Based Cross-Lingual Voice Conversion Using Bottleneck Features
    M. Kiran Reddy
    K. Sreenivasa Rao
    [J]. Neural Processing Letters, 2020, 51 : 2029 - 2042
  • [8] INCORPORATING DYNAMIC FEATURES INTO MINIMUM GENERATION ERROR TRAINING FOR HMM-BASED SPEECH SYNTHESIS
    Ninh, Duy Khanh
    Morise, Masanori
    Yamashita, Yoichi
    [J]. 2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 55 - 59
  • [9] DNN-Based Cross-Lingual Voice Conversion Using Bottleneck Features
    Reddy, M. Kiran
    Rao, K. Sreenivasa
    [J]. NEURAL PROCESSING LETTERS, 2020, 51 (02) : 2029 - 2042
  • [10] Modulation spectrum-based speech parameter trajectory smoothing for DNN-based speech synthesis using FFT spectra
    Takamichi, Shinnosuke
    [J]. 2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1308 - 1311