F0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and Syllable-level Feature Representation

被引：1

作者：

Janyoi, Pongsathon ^{[1
]}

Seresangtakul, Pusadee ^{[2
]}

机构：

[1] Khon Kaen Univ, Dept Comp Sci, Nat Language & Speech Proc Lab, Khon Kaen, Thailand

[2] Khon Kaen Univ, Dept Comp Sci, Fac Sci, Khon Kaen, Thailand

来源：

INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY | 2020年 / 17卷 / 06期

关键词：

Fundamental frequency; speech synthesis; deep neural networks; HMM; GENERATION;

D O I：

10.34028/iajit/17/6/9

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The generation of the fundamental frequency (F-0) plays an important role in speech synthesis, which directly influences the naturalness of synthetic speech. In conventional parametric speech synthesis, F-0 is predicted frame-by-frame. This method is insufficient to represent F-0 contours in larger units, especially tone contours of syllables in tonal languages that deviate as a result of long-term context dependency. This work proposes a syllable-level F-0 model that represents F-0 contours within syllables, using syllable-level F-0 parameters that comprise the sampling F-0 points and dynamic features. A Deep Neural Network (DNN) was used to represent the relationships between syllable-level contextual features and syllable-level F-0 parameters. The proposed model was examined using an Isarn speech synthesis system with both large and small training sets. For all training sets, the results of objective and subjective tests indicate that the proposed approach outperforms the baseline systems based on hidden Markov models and DNNS that predict F-0 values at the frame level.

引用

页码：906 / 915

页数：10

共 50 条

[31] A MULTI-LEVEL REPRESENTATION OF F0 USING THE CONTINUOUS WAVELET TRANSFORM AND THE DISCRETE COSINE TRANSFORM
Ribeiro, Manuel Sam
Clark, Robert A. J.
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4909 - 4913
[32] HMM-BASED EXPRESSIVE SPEECH SYNTHESIS BASED ON PHRASE-LEVEL F0 CONTEXT LABELING
Maeno, Yu
Nose, Takashi
Kobayashi, Takao
Koriyama, Tomoki
Ijima, Yusuke
Nakajima, Hideharu
Mizuno, Hideyuki
Yoshioka, Osamu
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7859 - 7863
[33] Using Noisy Speech to Study the Robustness of a Continuous F0 Modelling Method in HMM-based Speech Synthesis
Ogbureke, Kalu U.
Cabral, Joao P.
Carson-Berndsen, Julie
PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, : 67 - 70
[34] Tone-Group F0 selection for modeling focus prominence in small-footprint speech synthesis
Xydas, Gerasimos
Kouroupetroglou, Georgios
SPEECH COMMUNICATION, 2006, 48 (09) : 1057 - 1078
[35] F0 transformation for emotional speech synthesis using target approximation features and bidirectional associative memories
Ling, Zhenhua
Gao, Li
Dai, Lirong
Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/Journal of Tianjin University Science and Technology, 2015, 48 (08): : 670 - 674
[36] Head motion synthesis from speech using deep neural networks
Ding, Chuang
Xie, Lei
Zhu, Pengcheng
MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (22) : 9871 - 9888
[37] Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
Nemeth, Geza
SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 282 - 291
[38] Head motion synthesis from speech using deep neural networks
Chuang Ding
Lei Xie
Pengcheng Zhu
Multimedia Tools and Applications, 2015, 74 : 9871 - 9888
[39] Efficient deep neural networks for speech synthesis using bottleneck features
Joo, Young-Sun
Jun, Won-Suk
Kang, Hong-Goo
2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
[40] Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform
Zhaojie Luo
Jinhui Chen
Tetsuya Takiguchi
Yasuo Ariki
EURASIP Journal on Audio, Speech, and Music Processing, 2017

← 1 2 3 4 5 →