A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis

被引:0
|
作者
Ribeiro, Manuel Sam [1 ]
Yamagishi, Junichi [1 ,2 ]
Clark, Robert A. J. [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9YL, Midlothian, Scotland
[2] Natl Inst Informat, Tokyo, Japan
关键词
speech synthesis; prosody; f0; modeling; continuous wavelet transform; perceptual experiments;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The Continuous Wavelet Transform (CWT) has been recently proposed to model f0 in the context of speech synthesis. It was shown that systems using signal decomposition with the CWT tend to outperform systems that model the signal directly. Theft) signal is typically decomposed into various scales of differing frequency. In these experiments, we reconstruct f0 with selected frequencies and ask native listeners to judge the naturalness of synthesized utterances with respect to natural speech. Results indicate that HMM-generated f0 is comparable to the CWT low frequencies, suggesting it mostly generates utterances with neutral intonation. Middle frequencies achieve very high levels of naturalness, while very high frequencies am mostly noise.
引用
收藏
页码:1586 / 1590
页数:5
相关论文
共 50 条
  • [1] F0 generation in a text-to-speech system using a database of natural F0 patterns
    da Silva, CH
    Nagle, EJ
    Runstein, F
    Violaro, F
    ITS '98 PROCEEDINGS - SBT/IEEE INTERNATIONAL TELECOMMUNICATIONS SYMPOSIUM, VOLS 1 AND 2, 1998, : 213 - 218
  • [2] WAVELET-BASED DECOMPOSITION OF F0 AS A SECONDARY TASK FOR DNN-BASED SPEECH SYNTHESIS WITH MULTI-TASK LEARNING
    Ribeiro, Manuel Sam
    Watts, Oliver
    Yamagishi, Junichi
    Clark, Robert A. J.
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5525 - 5529
  • [3] NEURAL-NETWORK-BASED F0 TEXT-TO-SPEECH SYNTHESIZER FOR MANDARINE
    HWANG, SH
    CHEN, SH
    IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 1994, 141 (06): : 384 - 390
  • [4] A Superpositional Model Applied to F0 Parameterization using DCT for Text-to-Speech Synthesis
    Stan, Adriana
    Giurgiu, Mircea
    2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,
  • [5] Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    Zainko, Csaba
    Nemeth, Geza
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1150 - 1154
  • [6] Quality Analysis of Macroprosodic F0 Dynamics in Text-to-Speech Signals
    Norrenbrock, Christoph R.
    Hinterleitner, Florian
    Heute, Ulrich
    Moeller, Sebastian
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 454 - 457
  • [7] An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1059 - 1063
  • [8] Investigation of Prosodic F0 Layers in Hierarchical F0 Modeling for HMM-based Speech Synthesis
    Lei, Ming
    Wu, Yi-Jian
    Ling, Zhen-Hua
    Dai, Li-Rong
    2010 IEEE 10TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS (ICSP2010), VOLS I-III, 2010, : 613 - +
  • [9] Wavelet analysis used in text-to-speech synthesis
    Kobayashi, M
    Sakamoto, M
    Saito, T
    Hashimoto, Y
    Nishimura, M
    Suzuki, K
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-ANALOG AND DIGITAL SIGNAL PROCESSING, 1998, 45 (08): : 1125 - 1129
  • [10] PERCEPTUAL CLUSTERING BASED UNIT SELECTION OPTIMIZATION FOR CONCATENATIVE TEXT-TO-SPEECH SYNTHESIS
    Jiang, Tao
    Wu, Zhiyong
    Jia, Jia
    Cai, Lianhong
    2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 64 - 68