Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques

被引:32
|
作者
Turk, Oytun [1 ]
Schroeder, Marc [2 ]
机构
[1] Sensory Inc, Portland, OR 97209 USA
[2] DFKI GmbH Language Technol Lab, Speech Grp, D-66123 Saarbrucken, Germany
关键词
Expressive speech synthesis; prosody; voice conversion; voice quality transformation;
D O I
10.1109/TASL.2010.2041113
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Generating expressive synthetic voices requires carefully designed databases that contain sufficient amount of expressive speech material. This paper investigates voice conversion and modification techniques to reduce database collection and processing efforts while maintaining acceptable quality and naturalness. In a factorial design, we study the relative contributions of voice quality and prosody as well as the amount of distortions introduced by the respective signal manipulation steps. The unit selection engine in our open source and modular text-to-speech (TTS) framework MARY is extended with voice quality transformation using either GMM-based prediction or vocal tract copy resynthesis. These algorithms are then cross-combined with various prosody copy resynthesis methods. The overall expressive speech generation process functions as a postprocessing step on TTS outputs to transform neutral synthetic speech into aggressive, cheerful, or depressed speech. Cross-combinations of voice quality and prosody transformation algorithms are compared in listening tests for perceived expressive style and quality. The results show that there is a tradeoff between identification and naturalness. Combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings. The fine detail of both voice quality and prosody, as preserved by the copy synthesis, did contribute to a better identification as compared to the approximate models.
引用
收藏
页码:965 / 973
页数:9
相关论文
共 50 条
  • [1] Voice Quality Modelling for Expressive Speech Synthesis
    Monzo, Carlos
    Iriondo, Ignasi
    Socoro, Joan Claudi
    [J]. SCIENTIFIC WORLD JOURNAL, 2014,
  • [2] AN EVALUATION OF ALARYNGEAL SPEECH ENHANCEMENT METHODS BASED ON VOICE CONVERSION TECHNIQUES
    Doi, Hironori
    Nakamura, Keigo
    Toda, Tomoki
    Saruwatari, Hiroshi
    Shikano, Kiyohiro
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5136 - 5139
  • [3] Voice conversion using duration-embedded Bi-HMMs for expressive speech synthesis
    Wu, Chung-Hsien
    Hsia, Chi-Chun
    Liu, Te-Hsien
    Wang, Jhing-Fa
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (04): : 1109 - 1116
  • [4] Voice Conversion for Whispered Speech Synthesis
    Cotescu, Marius
    Drugman, Thomas
    Huybrechts, Goeric
    Lorenzo-Trueba, Jaime
    Moinet, Alexis
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 186 - 190
  • [5] Intonation and Prosody Conversion for Expressive Mandarin Speech Synthesis
    Zhu, Jing
    Yu, Yibiao
    [J]. PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 549 - 552
  • [6] ASSEM-VC: REALISTIC VOICE CONVERSION BY ASSEMBLING MODERN SPEECH SYNTHESIS TECHNIQUES
    Kim, Kang-Wook
    Park, Seung-Won
    Lee, Junhyeok
    Joe, Myun-Chul
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6997 - 7001
  • [7] Speech Modification for Prosody Conversion in Expressive Marathi Text-to-Speech Synthesis
    Anil, Manjare Chandraprabha
    Shirbahadurkar, S. D.
    [J]. 2014 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2014, : 56 - 58
  • [8] Online Model Adaptation for Voice Conversion using Model-based Speech Synthesis Techniques
    Wu, Dalei
    Li, Baojie
    Jiang, Hui
    Fu, Qian-Jie
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1611 - +
  • [9] The relative weights of the different prosodic dimensions in expressive speech:: A resynthesis study
    Audibert, N
    Aubergé, V
    Rilliard, A
    [J]. AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PROCEEDINGS, 2005, 3784 : 527 - 534
  • [10] Conversion function clustering and selection for expressive voice conversion
    Hsia, Chi-Chun
    Wu, Chung-Hsien
    Wu, Jian-Qi
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 689 - +