Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

被引:0
|
作者
Akuzawa, Kei [1 ]
Iwasawa, Yusuke [1 ]
Matsuo, Yutaka [1 ]
机构
[1] Univ Tokyo, Grad Sch Engn, Tokyo, Japan
关键词
autoregressive model; variational autoencoder; expressive speech synthesis;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the experssions in its synthesized speech by incorporating global characteristics into the speech generating process.
引用
收藏
页码:3067 / 3071
页数:5
相关论文
共 50 条
  • [21] Dataset Recommendation via Variational Graph Autoencoder
    Altaf, Basmah
    Akujuobi, Uchenna
    Yu, Lu
    Zhang, Xiangliang
    [J]. 2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 11 - 20
  • [22] Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis
    Evrard, Marc
    Delalez, Samuel
    d'Alessandro, Christophe
    Rilliard, Albert
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3370 - 3374
  • [23] A multimodal dynamical variational autoencoder for audiovisual speech representation learning
    Sadok, Samir
    Leglaive, Simon
    Girin, Laurent
    Alameda-Pineda, Xavier
    Seguier, Renaud
    [J]. NEURAL NETWORKS, 2024, 172
  • [24] UNSUPERVISED DOMAIN ADAPTATION FOR ROBUST SPEECH RECOGNITION VIA VARIATIONAL AUTOENCODER-BASED DATA AUGMENTATION
    Hsu, Wei-Ning
    Zhang, Yu
    Glass, James
    [J]. 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 16 - 23
  • [25] Predicting Head Pose from Speech with a Conditional Variational Autoencoder
    Greenwood, David
    Laycock, Stephen
    Matthews, Iain
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3991 - 3995
  • [26] Speech Source Separation Using Variational Autoencoder and Bandpass Filter
    Do, Hao Duc
    Tran, Son Thai
    Chau, Duc Thanh
    [J]. IEEE ACCESS, 2020, 8 : 156219 - 156231
  • [27] VARIATIONAL AUTOENCODER FOR SPEECH ENHANCEMENT WITH A NOISE-AWARE ENCODER
    Fang, Huajian
    Carbajal, Guillaume
    Wermter, Stefan
    Gerkmann, Timo
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 676 - 680
  • [28] EXPRESSIVE SPEECH SYNTHESIS FOR CRITICAL SITUATIONS
    Rusko, Milan
    Darjaa, Sakhia
    Trnka, Marian
    Sabo, Robert
    Ritomsk, Marian
    [J]. COMPUTING AND INFORMATICS, 2014, 33 (06) : 1312 - 1332
  • [29] Advancements in Expressive Speech Synthesis: a Review
    Alwaisi, Shaimaa
    Nemeth, Geza
    [J]. INFOCOMMUNICATIONS JOURNAL, 2024, 16 (01): : 35 - 46
  • [30] ARTICULATORY FEATURES FOR EXPRESSIVE SPEECH SYNTHESIS
    Black, Alan W.
    Bunnell, H. Timothy
    Dou, Ying
    Muthukumar, Prasanna Kumar
    Metze, Florian
    Perry, Daniel
    Polzehl, Tim
    Prahallad, Kishore
    Steidl, Stefan
    Vaughn, Callie
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4005 - 4008