F0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and Syllable-level Feature Representation

被引:1
|
作者
Janyoi, Pongsathon [1 ]
Seresangtakul, Pusadee [2 ]
机构
[1] Khon Kaen Univ, Dept Comp Sci, Nat Language & Speech Proc Lab, Khon Kaen, Thailand
[2] Khon Kaen Univ, Dept Comp Sci, Fac Sci, Khon Kaen, Thailand
关键词
Fundamental frequency; speech synthesis; deep neural networks; HMM; GENERATION;
D O I
10.34028/iajit/17/6/9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The generation of the fundamental frequency (F-0) plays an important role in speech synthesis, which directly influences the naturalness of synthetic speech. In conventional parametric speech synthesis, F-0 is predicted frame-by-frame. This method is insufficient to represent F-0 contours in larger units, especially tone contours of syllables in tonal languages that deviate as a result of long-term context dependency. This work proposes a syllable-level F-0 model that represents F-0 contours within syllables, using syllable-level F-0 parameters that comprise the sampling F-0 points and dynamic features. A Deep Neural Network (DNN) was used to represent the relationships between syllable-level contextual features and syllable-level F-0 parameters. The proposed model was examined using an Isarn speech synthesis system with both large and small training sets. For all training sets, the results of objective and subjective tests indicate that the proposed approach outperforms the baseline systems based on hidden Markov models and DNNS that predict F-0 values at the frame level.
引用
收藏
页码:906 / 915
页数:10
相关论文
共 50 条
  • [31] A MULTI-LEVEL REPRESENTATION OF F0 USING THE CONTINUOUS WAVELET TRANSFORM AND THE DISCRETE COSINE TRANSFORM
    Ribeiro, Manuel Sam
    Clark, Robert A. J.
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4909 - 4913
  • [32] HMM-BASED EXPRESSIVE SPEECH SYNTHESIS BASED ON PHRASE-LEVEL F0 CONTEXT LABELING
    Maeno, Yu
    Nose, Takashi
    Kobayashi, Takao
    Koriyama, Tomoki
    Ijima, Yusuke
    Nakajima, Hideharu
    Mizuno, Hideyuki
    Yoshioka, Osamu
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7859 - 7863
  • [33] Using Noisy Speech to Study the Robustness of a Continuous F0 Modelling Method in HMM-based Speech Synthesis
    Ogbureke, Kalu U.
    Cabral, Joao P.
    Carson-Berndsen, Julie
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, : 67 - 70
  • [34] Tone-Group F0 selection for modeling focus prominence in small-footprint speech synthesis
    Xydas, Gerasimos
    Kouroupetroglou, Georgios
    SPEECH COMMUNICATION, 2006, 48 (09) : 1057 - 1078
  • [35] F0 transformation for emotional speech synthesis using target approximation features and bidirectional associative memories
    Ling, Zhenhua
    Gao, Li
    Dai, Lirong
    Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/Journal of Tianjin University Science and Technology, 2015, 48 (08): : 670 - 674
  • [36] Head motion synthesis from speech using deep neural networks
    Ding, Chuang
    Xie, Lei
    Zhu, Pengcheng
    MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (22) : 9871 - 9888
  • [37] Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    Nemeth, Geza
    SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 282 - 291
  • [38] Head motion synthesis from speech using deep neural networks
    Chuang Ding
    Lei Xie
    Pengcheng Zhu
    Multimedia Tools and Applications, 2015, 74 : 9871 - 9888
  • [39] Efficient deep neural networks for speech synthesis using bottleneck features
    Joo, Young-Sun
    Jun, Won-Suk
    Kang, Hong-Goo
    2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [40] Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform
    Zhaojie Luo
    Jinhui Chen
    Tetsuya Takiguchi
    Yasuo Ariki
    EURASIP Journal on Audio, Speech, and Music Processing, 2017