Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

被引:6
|
作者
Kang, Xiao [1 ]
Huang, Hao [1 ,2 ]
Hu, Ying [1 ]
Huang, Zhihua [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] Xinjiang Prov Key Lab Multilingual Informat Techn, Urumqi, Peoples R China
基金
国家重点研发计划;
关键词
Voice conversion; Zero-shot; VQ-VAE; Connectionist temporal classification; NEURAL-NETWORKS;
D O I
10.1016/j.dsp.2021.103110
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method. (C) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页数:10
相关论文
共 46 条
  • [41] SIG-VC: A SPEAKER INFORMATION GUIDED ZERO-SHOT VOICE CONVERSION SYSTEM FOR BOTH HUMAN BEINGS AND MACHINES
    Zhang, Haozhe
    Cai, Zexin
    Qin, Xiaoyi
    Li, Ming
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6567 - 6571
  • [42] Enhancing Zero-Shot Many to Many Voice Conversion via Self-Attention VAE with Structurally Regularized Layers
    Long, Ziang
    Zheng, Yunling
    Yu, Meng
    Xin, Jack
    2022 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE FOR INDUSTRIES, AI4I, 2022, : 59 - 63
  • [43] Revolutionizing hyperspectral image classification for limited labeled data: unifying autoencoder-enhanced GANs with convolutional neural networks and zero-shot learning
    Pallavi Ranjan
    Anukriti Kaushal
    Ashish Girdhar
    Rajeev Kumar
    Earth Science Informatics, 2025, 18 (2)
  • [44] Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
    Lei, Yi
    Yang, Shan
    Cong, Jian
    Xie, Lei
    Su, Dan
    INTERSPEECH 2022, 2022, : 2563 - 2567
  • [45] Zero-Shot Remote Sensing Scene Classification Method Based on Local-Global Feature Fusion and Weight Mapping Loss
    Wang, Chao
    Li, Junyong
    Tanvir, Ahmed
    Yang, Jiajun
    Xie, Tao
    Ji, Liqiang
    Zhang, Tong
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 2763 - 2776
  • [46] Integrating Adversarial Generative Network with Variational Autoencoders towards Cross-Modal Alignment for Zero-Shot Remote Sensing Image Scene Classification
    Ma, Suqiang
    Liu, Chun
    Li, Zheng
    Yang, Wei
    REMOTE SENSING, 2022, 14 (18)