Improving Trajectory Modelling for DNN-Based Speech Synthesis by Using Stacked Bottleneck Features and Minimum Generation Error Training

被引:24
|
作者
Wu, Zhizheng [1 ]
King, Simon [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH89AB, Midlothian, Scotland
基金
英国工程与自然科学研究理事会;
关键词
Acoustic modelling; bottleneck; deep neural network; minimum generation error; speech synthesis; DEEP NEURAL-NETWORKS; HMM;
D O I
10.1109/TASLP.2016.2551865
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose two novel techniques-stacking bottleneck features and minimum generation error (MGE) training criterion-to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The MGE training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We present both objective and subjective results that demonstrate the effectiveness of the proposed techniques. The subjective results show that combining the two techniques leads to significantly more natural synthetic speech than from conventional DNN or long short-term memory recurrent neural network systems.
引用
收藏
页码:1255 / 1265
页数:11
相关论文
共 49 条
  • [41] Minimum unit selection error training for HMM-based unit selection speech synthesis system
    Ling, Zhen-Hua
    Wang, Ren-Hua
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 3949 - 3952
  • [42] Minimum generation error criterion considering global/local variance for HMM-based speech synthesis
    Wu, Yi-Jian
    Zen, Heiga
    Nankaku, Yoshilliko
    Tokuda, Keiichi
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4621 - 4624
  • [43] Minimum classification error training in example based speech and pattern recognition using sparse weight matrices
    Matton, Mike
    Van Compernolle, Dirk
    Cools, Ronald
    [J]. JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2010, 234 (04) : 1303 - 1311
  • [44] Modulation spectrum-constrained trajectory error training for mixture density network-based speech synthesis
    Park, Sangjun
    Hahn, Minsoo
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2018, 144 (03): : EL151 - EL157
  • [45] Improving the accuracy of the speech synthesis based phonetic alignment using multiple acoustic features
    Paulo, S
    Oliveira, LC
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANAGUAGE, PROCEEDINGS, 2003, 2721 : 31 - 39
  • [46] Tree-based Context Clustering Using Speech Recognition Features for Acoustic Model Training of Speech Synthesis
    Chanjaradwichai, Supadaech
    Suchato, Atiwong
    Punyabukkana, Proadpran
    [J]. 2015 12TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY (ECTI-CON), 2015,
  • [47] Sequence Generation Error (SGE) Minimization Based Deep Neural Networks Training for Text-to-Speech Synthesis
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 864 - 868
  • [48] Evaluation of Parameter Generation Using High Order Dynamic Features and Long Span Windows for HMM based Speech Synthesis
    Wang, Yang
    Tao, Jianhua
    [J]. 2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 516 - 520
  • [49] Improving F0 Prediction Using Bidirectional Associative Memories and Syllable-Level F0 Features for HMM-based Mandarin Speech Synthesis
    Gao, Li
    Ling, Zhen-Hua
    Chen, Ling-Hui
    Dai, Li-Rong
    [J]. 2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 275 - 279