Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network

被引:6
|
作者
Ho, Tuan Vu [1 ]
Akagi, Masato [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Grad Sch Adv Sci & Technol, Nomi 9231292, Japan
来源
IEEE ACCESS | 2021年 / 9卷
基金
日本学术振兴会;
关键词
Training; Linguistics; Generative adversarial networks; Gallium nitride; Acoustics; Decoding; Task analysis; Voice conversion; cross-lingual; controllable speaker individuality; variational autoencoder; generative adversarial network;
D O I
10.1109/ACCESS.2021.3063519
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a non-parallel cross-lingual voice conversion (CLVC) model that can mimic voice while continuously controlling speaker individuality on the basis of the variational autoencoder (VAE) and star generative adversarial network (StarGAN). Most studies on CLVC only focused on mimicking a particular speaker voice without being able to arbitrarily modify the speaker individuality. In practice, the ability to generate speaker individuality may be more useful than just mimicking voice. Therefore, the proposed model reliably extracts the speaker embedding from different languages using a VAE. An F0 injection method is also introduced into our model to enhance the F0 modeling in the cross-lingual setting. To avoid the over-smoothing degradation problem of the conventional VAE, the adversarial training scheme of the StarGAN is adopted to improve the training-objective function of the VAE in a CLVC task. Objective and subjective measurements confirm the effectiveness of the proposed model and F0 injection method. Furthermore, speaker-similarity measurement on fictitious voices reveal a strong linear relationship between speaker individuality and interpolated speaker embedding, which indicates that speaker individuality can be controlled with our proposed model.
引用
收藏
页码:47503 / 47515
页数:13
相关论文
共 32 条
  • [31] A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis
    Ramani, B.
    Jeeva, M. P. Actlin
    Vijayalakshmi, P.
    Nagarajan, T.
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2016, 35 (04) : 1283 - 1311
  • [32] Improved Prediction of Aquatic Beetle Diversity in a Stagnant Pool by a One-Dimensional Convolutional Neural Network Using Variational Autoencoder Generative Adversarial Network-Generated Data
    Hu, Miao
    Jiang, Shujiao
    Jia, Fenglong
    Yang, Xiaomei
    Li, Zhiqiang
    APPLIED SCIENCES-BASEL, 2023, 13 (15):