LEARNING ACCENT REPRESENTATION WITH MULTI-LEVEL VAE TOWARDS CONTROLLABLE SPEECH SYNTHESIS

被引:2
|
作者
Melechovsky, Jan [1 ]
Mehrish, Ambuj [1 ]
Herremans, Dorien [1 ]
Sisman, Berrak [2 ]
机构
[1] Singapore Univ Technol & Design, Singapore, Singapore
[2] Univ Texas Dallas, Richardson, TX USA
关键词
Accent; Text-to-Speech; Multi-level Variational Autoencoder; Disentanglement; Controllable speech synthesis; CONVERSION;
D O I
10.1109/SLT54892.2023.10023072
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.
引用
收藏
页码:928 / 935
页数:8
相关论文
共 50 条
  • [1] Learning Disentangled User Representation Based on Controllable VAE for Recommendation
    Li, Yunyi
    Zhao, Pengpeng
    Wang, Deqing
    Xian, Xuefeng
    Liu, Yanchi
    Sheng, Victor S.
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2021), PT III, 2021, 12683 : 179 - 194
  • [2] Multi-Level Representation Learning for Deep Subspace Clustering
    Kheirandishfard, Mohsen
    Zohrizadeh, Fariba
    Kamangar, Farhad
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 2028 - 2037
  • [3] TOWARDS UNSUPERVISED SPEECH RECOGNITION AND SYNTHESIS WITH QUANTIZED SPEECH REPRESENTATION LEARNING
    Liu, Alexander H.
    Tu, Tao
    Lee, Hung-yi
    Lee, Lin-shan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7259 - 7263
  • [4] A deep representation learning speech enhancement method using β-VAE
    Xiang, Yang
    Hojvang, Jesper Lisby
    Rasmussen, Morten Hojfeldt
    Christensen, Mads Graesboll
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 359 - 363
  • [5] Multi-level molecular representation
    Olivier, P
    Nakata, K
    Landon, M
    ARTIFICIAL INTELLIGENCE IN DESIGN '96, 1996, : 3 - 20
  • [6] Graph Representation Learning Model for Multi-Level Feature Augmentation
    Feng, Yao
    Kong, Bing
    Zhou, Lihua
    Bao, Chongming
    Wang, Chongyun
    Computer Engineering and Applications, 2023, 59 (11) : 131 - 140
  • [7] MRLR: Multi-level Representation Learning for Personalized Ranking in Recommendation
    Sun, Zhu
    Yang, Jie
    Zhang, Jie
    Bozzon, Alessandro
    Chen, Yu
    Xu, Chi
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2807 - 2813
  • [8] INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS
    Cornille, Tobias
    Wang, Fengna
    Bekker, Jessa
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8312 - 8316
  • [9] Multi-level Prosody and Spectrum Conversion for Emotional Speech Synthesis
    Wang, Zexun
    Yu, Yibiao
    2014 12TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP), 2014, : 588 - 593
  • [10] Synthesis approach to multi-level regular representation for combinational circuits
    ChrzanowskaJeske, M
    Guo, CP
    ICECS 96 - PROCEEDINGS OF THE THIRD IEEE INTERNATIONAL CONFERENCE ON ELECTRONICS, CIRCUITS, AND SYSTEMS, VOLS 1 AND 2, 1996, : 374 - 377