Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network

被引:6
|
作者
Ho, Tuan Vu [1 ]
Akagi, Masato [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Grad Sch Adv Sci & Technol, Nomi 9231292, Japan
来源
IEEE ACCESS | 2021年 / 9卷
基金
日本学术振兴会;
关键词
Training; Linguistics; Generative adversarial networks; Gallium nitride; Acoustics; Decoding; Task analysis; Voice conversion; cross-lingual; controllable speaker individuality; variational autoencoder; generative adversarial network;
D O I
10.1109/ACCESS.2021.3063519
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a non-parallel cross-lingual voice conversion (CLVC) model that can mimic voice while continuously controlling speaker individuality on the basis of the variational autoencoder (VAE) and star generative adversarial network (StarGAN). Most studies on CLVC only focused on mimicking a particular speaker voice without being able to arbitrarily modify the speaker individuality. In practice, the ability to generate speaker individuality may be more useful than just mimicking voice. Therefore, the proposed model reliably extracts the speaker embedding from different languages using a VAE. An F0 injection method is also introduced into our model to enhance the F0 modeling in the cross-lingual setting. To avoid the over-smoothing degradation problem of the conventional VAE, the adversarial training scheme of the StarGAN is adopted to improve the training-objective function of the VAE in a CLVC task. Objective and subjective measurements confirm the effectiveness of the proposed model and F0 injection method. Furthermore, speaker-similarity measurement on fictitious voices reveal a strong linear relationship between speaker individuality and interpolated speaker embedding, which indicates that speaker individuality can be controlled with our proposed model.
引用
收藏
页码:47503 / 47515
页数:13
相关论文
共 32 条
  • [1] Non-parallel Voice Conversion with Controllable Speaker Individuality using Variational Autoencoder
    Tuan Vu Ho
    Akagi, Masato
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 106 - 111
  • [2] ON THE STUDY OF GENERATIVE ADVERSARIAL NETWORKS FOR CROSS-LINGUAL VOICE CONVERSION
    Sisman, Berrak
    Zhang, Mingyang
    Dong, Minghui
    Li, Haizhou
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 144 - 151
  • [3] Many-to-many Cross-lingual Voice Conversion with a Jointly Trained Speaker Embedding Network
    Zhou, Yi
    Tian, Xiaohai
    Das, Rohan Kumar
    Li, Haizhou
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1282 - 1287
  • [4] Cross-Lingual Voice Conversion using a Cyclic Variational Auto-encoder and a WaveNet Vocoder
    Nakatani, Hikaru
    Tobing, Patrick Lumban
    Takeda, Kazuya
    Toda, Tomoki
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 520 - 526
  • [5] SVCGAN: Speaker Voice Conversion Generative Adversarial Network for Children's Speech Conversion and Recognition
    Xie, Chenghuan
    Zhou, Aimin
    JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (03) : 2182 - 2196
  • [6] Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion
    Huang, Wen-Chin
    Luo, Hao
    Hwang, Hsin-Te
    Lo, Chen-Chou
    Peng, Yu-Huai
    Tsao, Yu
    Wang, Hsin-Min
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2020, 4 (04): : 468 - 479
  • [7] DNN-Based Cross-Lingual Voice Conversion Using Bottleneck Features
    M. Kiran Reddy
    K. Sreenivasa Rao
    Neural Processing Letters, 2020, 51 : 2029 - 2042
  • [8] DNN-Based Cross-Lingual Voice Conversion Using Bottleneck Features
    Reddy, M. Kiran
    Rao, K. Sreenivasa
    NEURAL PROCESSING LETTERS, 2020, 51 (02) : 2029 - 2042
  • [9] DATA AUGMENTATION FOR MONAURAL SINGING VOICE SEPARATION BASED ON VARIATIONAL AUTOENCODER-GENERATIVE ADVERSARIAL NETWORK
    He, Boxin
    Wang, Shengbei
    Yuan, Weitao
    Wang, Jianming
    Unoki, Masashi
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1354 - 1359
  • [10] TOWARDS NATURAL AND CONTROLLABLE CROSS-LINGUAL VOICE CONVERSION BASED ON NEURAL TTS MODEL AND PHONETIC POSTERIORGRAM
    Zhao, Shengkui
    Wang, Hao
    Trung Hieu Nguyen
    Ma, Bin
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5969 - 5973