DISCOURSE-LEVEL PROSODY MODELING WITH A VARIATIONAL AUTOENCODER FOR NON-AUTOREGRESSIVE EXPRESSIVE SPEECH SYNTHESIS

被引:2
|
作者
Wu, Ning-Qian [1 ]
Liu, Zhao-Ci [1 ]
Ling, Zhen-Hua [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Peoples R China
基金
国家重点研发计划;
关键词
speech synthesis; prosody modeling; FastSpeech; discourse-level modeling; variational autoencoder;
D O I
10.1109/ICASSP43922.2022.9746238
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
To address the issue of one-to-many mapping from phoneme sequences to acoustic features in expressive speech synthesis, this paper proposes a method of discourse-level prosody modeling with a variational autoencoder (VAE) based on the non-autoregressive architecture of FastSpeech. In this method, phone-level prosody codes are extracted from prosody features by combining VAE with Fast-Speech, and are predicted using discourse-level text features together with BERT embeddings. The continuous wavelet transform (CWT) in FastSpeech2 for F0 representation is not necessary anymore. Experimental results on a Chinese audiobook dataset show that our proposed method can effectively take advantage of discourse-level linguistic information and has outperformed FastSpeech2 on the naturalness and expressiveness of synthetic speech.
引用
收藏
页码:7592 / 7596
页数:5
相关论文
共 21 条
  • [1] HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS
    Chien, Chung-Ming
    Lee, Hung-yi
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 446 - 453
  • [2] Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder
    Akuzawa, Kei
    Iwasawa, Yusuke
    Matsuo, Yutaka
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3067 - 3071
  • [3] Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech
    Bae, Jae-Sung
    Yang, Jinhyeok
    Bak, Tae-Jun
    Joo, Young-Sun
    [J]. INTERSPEECH 2022, 2022, : 813 - 817
  • [4] HIERARCHICAL PROSODY MODELING AND CONTROL IN NON-AUTOREGRESSIVE PARALLEL NEURAL TTS
    Raitio, Tuomo
    Li, Jiangchuan
    Seshadri, Shreyas
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7587 - 7591
  • [5] Estonian Text-to-Speech Synthesis with Non-autoregressive Transformers
    Ratsep, Liisa
    Lellep, Rasmus
    Fishel, Mark
    [J]. BALTIC JOURNAL OF MODERN COMPUTING, 2022, 10 (03): : 447 - 456
  • [6] Non-autoregressive Speech Synthesis by Fusion of CoordConv and Sound Quality
    Zhao, Wei
    Guo, Zhiyuan
    Xie, Fei
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 243 - 248
  • [7] VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis
    Lu, Hui
    Wu, Zhiyong
    Wu, Xixin
    Li, Xu
    Kang, Shiyin
    Liu, Xunying
    Meng, Helen
    [J]. INTERSPEECH 2021, 2021, : 3775 - 3779
  • [8] INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS
    Cornille, Tobias
    Wang, Fengna
    Bekker, Jessa
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8312 - 8316
  • [9] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis
    Liu, Zhaoci
    Wu, Ningqian
    Zhang, Yajie
    Ling, Zhenhua
    [J]. INTERSPEECH 2022, 2022, : 5508 - 5512
  • [10] Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis
    Lee, Moa
    Lee, Junmo
    Chang, Joon-Hyuk
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1150 - 1159