Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

被引:6
|
作者
Kang, Xiao [1 ]
Huang, Hao [1 ,2 ]
Hu, Ying [1 ]
Huang, Zhihua [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] Xinjiang Prov Key Lab Multilingual Informat Techn, Urumqi, Peoples R China
基金
国家重点研发计划;
关键词
Voice conversion; Zero-shot; VQ-VAE; Connectionist temporal classification; NEURAL-NETWORKS;
D O I
10.1016/j.dsp.2021.103110
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method. (C) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页数:10
相关论文
共 46 条
  • [1] TVQVC: Transformer based Vector Quantized Variational Autoencoder with CTC loss for Voice Conversion
    Chen, Ziyi
    Zhang, Pengyuan
    INTERSPEECH 2021, 2021, : 826 - 830
  • [2] AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
    Qian, Kaizhi
    Zhang, Yang
    Chang, Shiyu
    Yang, Xuesong
    Hasegawa-Johnson, Mark
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [3] Variational Autoencoder for Zero-Shot Recognition of Bai Characters
    Lin, Weiwei
    Ma, Tai
    Zhang, Zeqing
    Li, Xiaofan
    Xue, Xingsi
    Wireless Communications and Mobile Computing, 2022, 2022
  • [4] Variational Autoencoder for Zero-Shot Recognition of Bai Characters
    Lin, Weiwei
    Ma, Tai
    Zhang, Zeqing
    Li, Xiaofan
    Xue, Xingsi
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
  • [5] ROBUST DISENTANGLED VARIATIONAL SPEECH REPRESENTATION LEARNING FOR ZERO-SHOT VOICE CONVERSION
    Lian, Jiachen
    Zhang, Chunlei
    Yu, Dong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6572 - 6576
  • [6] DGC-VECTOR: A NEW SPEAKER EMBEDDING FOR ZERO-SHOT VOICE CONVERSION
    Xiao, Ruitong
    Zhang, Haitong
    Lin, Yue
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6547 - 6551
  • [7] Learning Autoencoder of Attribute Constraint for Zero-Shot Classification
    Wang, Kun
    Wu, Songsong
    Gao, Guangwei
    Zhou, Quan
    Jing, Xiao-Yuan
    PROCEEDINGS 2017 4TH IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR), 2017, : 605 - 610
  • [8] Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion
    Ding, Shaojin
    Gutierrez-Osuna, Ricardo
    INTERSPEECH 2019, 2019, : 724 - 728
  • [9] Zero-shot voice conversion based on feature disentanglement
    Guo, Na
    Wei, Jianguo
    Li, Yongwei
    Lu, Wenhuan
    Tao, Jianhua
    Speech Communication, 2024, 165
  • [10] A Variational Autoencoder with Deep Embedding Model for Generalized Zero-Shot Learning
    Ma, Peirong
    Hu, Xiao
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11733 - 11740