F0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and Syllable-level Feature Representation

被引:1
|
作者
Janyoi, Pongsathon [1 ]
Seresangtakul, Pusadee [2 ]
机构
[1] Khon Kaen Univ, Dept Comp Sci, Nat Language & Speech Proc Lab, Khon Kaen, Thailand
[2] Khon Kaen Univ, Dept Comp Sci, Fac Sci, Khon Kaen, Thailand
关键词
Fundamental frequency; speech synthesis; deep neural networks; HMM; GENERATION;
D O I
10.34028/iajit/17/6/9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The generation of the fundamental frequency (F-0) plays an important role in speech synthesis, which directly influences the naturalness of synthetic speech. In conventional parametric speech synthesis, F-0 is predicted frame-by-frame. This method is insufficient to represent F-0 contours in larger units, especially tone contours of syllables in tonal languages that deviate as a result of long-term context dependency. This work proposes a syllable-level F-0 model that represents F-0 contours within syllables, using syllable-level F-0 parameters that comprise the sampling F-0 points and dynamic features. A Deep Neural Network (DNN) was used to represent the relationships between syllable-level contextual features and syllable-level F-0 parameters. The proposed model was examined using an Isarn speech synthesis system with both large and small training sets. For all training sets, the results of objective and subjective tests indicate that the proposed approach outperforms the baseline systems based on hidden Markov models and DNNS that predict F-0 values at the frame level.
引用
收藏
页码:906 / 915
页数:10
相关论文
共 50 条
  • [21] Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0
    Corkey, Niamh
    O'Mahony, Johannah
    King, Simon
    INTERSPEECH 2023, 2023, : 2014 - 2015
  • [22] Soft context clustering for F0 modeling in HMM-based speech synthesis
    Khorram, Soheil
    Sameti, Hossein
    King, Simon
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015,
  • [23] Soft context clustering for F0 modeling in HMM-based speech synthesis
    Soheil Khorram
    Hossein Sameti
    Simon King
    EURASIP Journal on Advances in Signal Processing, 2015
  • [24] MULTI-LAYER F0 MODELING FOR HMM-BASED SPEECH SYNTHESIS
    Wang, Cheng-Cheng
    Ling, Zhen-Hua
    Zhang, Bu-Fan
    Dai, Li-Rong
    2008 6TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2008, : 129 - 132
  • [25] Enhanced F0 generation for GPR-based speech synthesis considering syllable-based prosodic features
    Moungsri, Decha
    Koriyama, Tomoki
    Kobayashi, Takao
    2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1575 - 1578
  • [26] A Method for Automatically Estimating F0 Model Parameters and A Speech Re-Synthesis Tool Using F0 Model and STRAIGHT
    Sato, Shota
    Kimura, Taro
    Horiuchi, Yasuo
    Nishida, Masafumi
    Kuroiwa, Shingo
    Ichikawa, Akira
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 545 - +
  • [27] CROSS-STREAM DEPENDENCY MODELING USING CONTINUOUS F0 MODEL FOR HMM-BASED SPEECH SYNTHESIS
    Wang, Xin
    Ling, Zhen-Hua
    Dai, Li-Rong
    2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 84 - 87
  • [28] Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    COMPUTER SPEECH AND LANGUAGE, 2013, 27 (05): : 1105 - 1126
  • [29] STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS
    Zen, Heiga
    Senior, Andrew
    Schuster, Mike
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7962 - 7966
  • [30] A Superpositional Model Applied to F0 Parameterization using DCT for Text-to-Speech Synthesis
    Stan, Adriana
    Giurgiu, Mircea
    2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,