F0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and Syllable-level Feature Representation

被引：1

作者：

Janyoi, Pongsathon ^{[1
]}

Seresangtakul, Pusadee ^{[2
]}

机构：

[1] Khon Kaen Univ, Dept Comp Sci, Nat Language & Speech Proc Lab, Khon Kaen, Thailand

[2] Khon Kaen Univ, Dept Comp Sci, Fac Sci, Khon Kaen, Thailand

来源：

INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY | 2020年 / 17卷 / 06期

关键词：

Fundamental frequency; speech synthesis; deep neural networks; HMM; GENERATION;

D O I：

10.34028/iajit/17/6/9

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The generation of the fundamental frequency (F-0) plays an important role in speech synthesis, which directly influences the naturalness of synthetic speech. In conventional parametric speech synthesis, F-0 is predicted frame-by-frame. This method is insufficient to represent F-0 contours in larger units, especially tone contours of syllables in tonal languages that deviate as a result of long-term context dependency. This work proposes a syllable-level F-0 model that represents F-0 contours within syllables, using syllable-level F-0 parameters that comprise the sampling F-0 points and dynamic features. A Deep Neural Network (DNN) was used to represent the relationships between syllable-level contextual features and syllable-level F-0 parameters. The proposed model was examined using an Isarn speech synthesis system with both large and small training sets. For all training sets, the results of objective and subjective tests indicate that the proposed approach outperforms the baseline systems based on hidden Markov models and DNNS that predict F-0 values at the frame level.

引用

页码：906 / 915

页数：10

共 50 条

[21] Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0
Corkey, Niamh
O'Mahony, Johannah
King, Simon
INTERSPEECH 2023, 2023, : 2014 - 2015
[22] Soft context clustering for F0 modeling in HMM-based speech synthesis
Khorram, Soheil
Sameti, Hossein
King, Simon
EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015,
[23] Soft context clustering for F0 modeling in HMM-based speech synthesis
Soheil Khorram
Hossein Sameti
Simon King
EURASIP Journal on Advances in Signal Processing, 2015
[24] MULTI-LAYER F0 MODELING FOR HMM-BASED SPEECH SYNTHESIS
Wang, Cheng-Cheng
Ling, Zhen-Hua
Zhang, Bu-Fan
Dai, Li-Rong
2008 6TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2008, : 129 - 132
[25] Enhanced F0 generation for GPR-based speech synthesis considering syllable-based prosodic features
Moungsri, Decha
Koriyama, Tomoki
Kobayashi, Takao
2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1575 - 1578
[26] A Method for Automatically Estimating F0 Model Parameters and A Speech Re-Synthesis Tool Using F0 Model and STRAIGHT
Sato, Shota
Kimura, Taro
Horiuchi, Yasuo
Nishida, Masafumi
Kuroiwa, Shingo
Ichikawa, Akira
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 545 - +
[27] CROSS-STREAM DEPENDENCY MODELING USING CONTINUOUS F0 MODEL FOR HMM-BASED SPEECH SYNTHESIS
Wang, Xin
Ling, Zhen-Hua
Dai, Li-Rong
2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 84 - 87
[28] Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis
Reddy, V. Ramu
Rao, K. Sreenivasa
COMPUTER SPEECH AND LANGUAGE, 2013, 27 (05): : 1105 - 1126
[29] STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS
Zen, Heiga
Senior, Andrew
Schuster, Mike
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7962 - 7966
[30] A Superpositional Model Applied to F0 Parameterization using DCT for Text-to-Speech Synthesis
Stan, Adriana
Giurgiu, Mircea
2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,

← 1 2 3 4 5 →