DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope

被引:2
|
作者
Koguchi, Junya [1 ]
Takamichi, Shinnosuke [2 ]
Morise, Masanori [1 ]
Saruwatari, Hiroshi [2 ]
Sagayama, Shigeki [3 ]
机构
[1] Meiji Univ, Tokyo 1648525, Japan
[2] Univ Tokyo, Tokyo 1138656, Japan
[3] Univ Electrocommun, Chofu, Tokyo 1828585, Japan
来源
关键词
Gaussian mixture model; spectral envelope; vocoder; deep neural network; text-to-speech synthesis; SYSTEM;
D O I
10.1587/transinf.2020EDP7075
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
引用
收藏
页码:2673 / 2681
页数:9
相关论文
共 50 条
  • [1] Real-time, full-band, online DNN-based voice conversion system using a single CPU
    Saeki, Takaaki
    Saito, Yuki
    Takamichi, Shinnosuke
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2020, 2020, : 1021 - 1022
  • [2] DNN-Based Speech Synthesis Using Speaker Codes
    Hojo, Nobukatsu
    Ijima, Yusuke
    Mizuno, Hideyuki
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (02): : 462 - 472
  • [3] DNN-Based Arabic Speech Synthesis
    Amrouche, Aissa
    Bentrcia, Youssouf
    Boubakeur, Khadidja Nesrine
    Abed, Ahcene
    [J]. 2022 9TH INTERNATIONAL CONFERENCE ON ELECTRICAL AND ELECTRONICS ENGINEERING (ICEEE 2022), 2022, : 378 - 382
  • [4] Low-dimensional representation of spectral envelope without deterioration for full-band speech analysis/synthesis system
    Morise, Masanori
    Miyashita, Genta
    Ozawa, Kenji
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 409 - 413
  • [5] An Investigation of DNN-Based Speech Synthesis Using Speaker Codes
    Hojo, Nobukatsu
    Ijima, Yusuke
    Mizuno, Hideyuki
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2278 - 2282
  • [6] Local spectral attention for full-band speech enhancement
    Hou, Zhongshu
    Hu, Qinwen
    Chen, Kai
    Cao, Zhanzhong
    Lu, Jing
    [J]. JASA EXPRESS LETTERS, 2023, 3 (11):
  • [7] DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus
    Yamashita, Yuki
    Koriyama, Tomoki
    Saito, Yuki
    Takamichi, Shinnosuke
    Ijima, Yusuke
    Masumura, Ryo
    Saruwatari, Hiroshi
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6438 - 6443
  • [8] ADAPTING AND CONTROLLING DNN-BASED SPEECH SYNTHESIS USING INPUT CODES
    Luong, Hieu-Thi
    Takaki, Shinji
    Hente, Gustav Eje
    Yamagishi, Junichi
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4905 - 4909
  • [9] A DNN-based emotional speech synthesis by speaker adaptation
    Yang, Hongwu
    Zhang, Weizhao
    Zhi, Pengpeng
    [J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 633 - 637
  • [10] Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model
    Degottex, Gilles
    Stylianou, Yannis
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (10): : 2085 - 2095