LEARNING ACCENT REPRESENTATION WITH MULTI-LEVEL VAE TOWARDS CONTROLLABLE SPEECH SYNTHESIS

被引:2
|
作者
Melechovsky, Jan [1 ]
Mehrish, Ambuj [1 ]
Herremans, Dorien [1 ]
Sisman, Berrak [2 ]
机构
[1] Singapore Univ Technol & Design, Singapore, Singapore
[2] Univ Texas Dallas, Richardson, TX USA
关键词
Accent; Text-to-Speech; Multi-level Variational Autoencoder; Disentanglement; Controllable speech synthesis; CONVERSION;
D O I
10.1109/SLT54892.2023.10023072
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.
引用
收藏
页码:928 / 935
页数:8
相关论文
共 50 条
  • [31] A multi-level synthesis of dyslexia
    Phoenix, Chris
    UNIFYING THEMES IN COMPLEX SYSTEMS IV, 2008, : 100 - 112
  • [32] MoFAP: A Multi-level Representation for Action Recognition
    Wang, Limin
    Qiao, Yu
    Tang, Xiaoou
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2016, 119 (03) : 254 - 271
  • [33] MoFAP: A Multi-level Representation for Action Recognition
    Limin Wang
    Yu Qiao
    Xiaoou Tang
    International Journal of Computer Vision, 2016, 119 : 254 - 271
  • [34] DIF - A FRAMEWORK FOR VLSI MULTI-LEVEL REPRESENTATION
    LAPOTIN, DP
    NASSIF, SR
    RAJAN, JV
    BUSHNELL, ML
    NESTOR, JA
    INTEGRATION-THE VLSI JOURNAL, 1984, 2 (03) : 227 - 241
  • [35] Multi-level Exemplar-Based Duration Generation for Expressive Speech Synthesis
    Abou-Zleikha, Mohamed
    Szekely, Eva
    Cahill, Peter
    Carson-Berndsen, Julie
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, : 59 - 62
  • [36] Multi-Level Symbolic Regression: Function Structure Learning for Multi-Level Data
    Sen Fong, Kei
    Motani, Mehul
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [37] ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation
    Ye, Hanchen
    Hao, Cong
    Cheng, Jianyi
    Jeong, Hyunmin
    Huang, Jack
    Neuendorffer, Stephen
    Chen, Deming
    2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022), 2022, : 741 - 755
  • [38] Multi-accent Speech Separation with One Shot Learning
    Huang, Kuan Po
    Wu, Yuan-Kuei
    Lee, Hung-yi
    1ST WORKSHOP ON META LEARNING AND ITS APPLICATIONS TO NATURAL LANGUAGE PROCESSING (METANLP 2021), 2021, : 59 - 66
  • [39] SMAK-Net: Self-Supervised Multi-level Spatial Attention Network for Knowledge Representation towards Imitation Learning
    Ramachandruni, Kartik
    Vankadari, Madhu
    Majumder, Anima
    Dutta, Samrat
    Kumar, Swagat
    2019 28TH IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION (RO-MAN), 2019,
  • [40] Towards Cooperative Caching for Vehicular Networks with Multi-level Federated Reinforcement Learning
    Zhao, Lei
    Ran, Yongyi
    Wang, Hao
    Wang, Junxia
    Luo, Jiangtao
    IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2021), 2021,