Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network

被引:6
|
作者
Ho, Tuan Vu [1 ]
Akagi, Masato [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Grad Sch Adv Sci & Technol, Nomi 9231292, Japan
来源
IEEE ACCESS | 2021年 / 9卷
基金
日本学术振兴会;
关键词
Training; Linguistics; Generative adversarial networks; Gallium nitride; Acoustics; Decoding; Task analysis; Voice conversion; cross-lingual; controllable speaker individuality; variational autoencoder; generative adversarial network;
D O I
10.1109/ACCESS.2021.3063519
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a non-parallel cross-lingual voice conversion (CLVC) model that can mimic voice while continuously controlling speaker individuality on the basis of the variational autoencoder (VAE) and star generative adversarial network (StarGAN). Most studies on CLVC only focused on mimicking a particular speaker voice without being able to arbitrarily modify the speaker individuality. In practice, the ability to generate speaker individuality may be more useful than just mimicking voice. Therefore, the proposed model reliably extracts the speaker embedding from different languages using a VAE. An F0 injection method is also introduced into our model to enhance the F0 modeling in the cross-lingual setting. To avoid the over-smoothing degradation problem of the conventional VAE, the adversarial training scheme of the StarGAN is adopted to improve the training-objective function of the VAE in a CLVC task. Objective and subjective measurements confirm the effectiveness of the proposed model and F0 injection method. Furthermore, speaker-similarity measurement on fictitious voices reveal a strong linear relationship between speaker individuality and interpolated speaker embedding, which indicates that speaker individuality can be controlled with our proposed model.
引用
收藏
页码:47503 / 47515
页数:13
相关论文
共 32 条
  • [21] Multispectral Image Reconstruction From Color Images Using Enhanced Variational Autoencoder and Generative Adversarial Network
    Liu, Xu
    Gherbi, Abdelouahed
    Wei, Zhenzhou
    Li, Wubin
    Cheriet, Mohamed
    IEEE ACCESS, 2021, 9 : 1666 - 1679
  • [22] Deep Beacon: Image Storage and Broadcast over BLE Using Variational Autoencoder Generative Adversarial Network
    Shao, Chong
    Nirjon, Shahriar
    2018 14TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SENSOR SYSTEMS (DCOSS), 2018, : 147 - 154
  • [23] A comprehensive review of synthetic data generation in smart farming by using variational autoencoder and generative adversarial network
    Akkem, Yaganteeswarudu
    Biswas, Saroj Kumar
    Varanasi, Aruna
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 131
  • [24] MULTI-SPEAKER AND MULTI-DOMAIN EMOTIONAL VOICE CONVERSION USING FACTORIZED HIERARCHICAL VARIATIONAL AUTOENCODER
    Elgaar, Mohamed
    Park, Jungbae
    Lee, Sang Wan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7769 - 7773
  • [25] DISENTANGLED SPEECH REPRESENTATION LEARNING FOR ONE-SHOT CROSS-LINGUAL VOICE CONVERSION USING β-VAE
    Lu, Hui
    Wang, Disong
    Wu, Xixin
    Wu, Zhiyong
    Liu, Xunying
    Meng, Helen
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 814 - 821
  • [26] Bolstering IoT security with IoT device type Identification using optimized Variational Autoencoder Wasserstein Generative Adversarial Network
    Sankar, Jothi Shri
    Dhatchnamurthy, Saravanan
    Mary, X. Anitha
    Gupta, Keerat Kumar
    NETWORK-COMPUTATION IN NEURAL SYSTEMS, 2024, 35 (03) : 278 - 299
  • [27] STARGAN-VC: NON-PARALLEL MANY-TO-MANY VOICE CONVERSION USING STAR GENERATIVE ADVERSARIAL NETWORKS
    Kameoka, Hirokazu
    Kaneko, Takuhiro
    Tanaka, Kou
    Hojo, Nobukatsu
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 266 - 273
  • [28] High-Quality Many-to-Many Voice Conversion Using Transitive Star Generative Adversarial Networks with Adaptive Instance Normalization
    Li, Yanping
    He, Zhengtao
    Zhang, Yan
    Yang, Zhen
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2021, 30 (10)
  • [29] Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks
    Luo, Zhaojie
    Chen, Jinhui
    Takiguchi, Tetsuya
    Ariki, Yasuo
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2019, 8 : 1 - 11
  • [30] A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis
    B. Ramani
    M. P. Actlin Jeeva
    P. Vijayalakshmi
    T. Nagarajan
    Circuits, Systems, and Signal Processing, 2016, 35 : 1283 - 1311