Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion

被引:0
|
作者
Um, Ji Sub [1 ]
Kim, Hoirin [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea
来源
基金
新加坡国家研究基金会;
关键词
zero-shot voice conversion; adaptive normalization layer; cluster-based pseudo label; auxiliary learning;
D O I
10.21437/Interspeech.2024-539
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, there has been an increase in research on zero-shot voice conversion. Many conventional studies use dynamic layers to conduct conversion for unseen speakers. Our aim is to extend dynamic methods to transmit content information as well. To achieve this, we propose AGRN-VC, which utilizes ConvNeXt V2 modules with adaptive global response normalization (AGRN) layers to convey content information. When conveying this information, it is crucial to ensure that the source speaker's information is not transmitted. So we adopt auxiliary learning with cluster-based pseudo labels. It helps the content encoder to focus on content information while excluding speaker information by performing a pseudo label classification task using its output. We conduct comparative experiments between various baseline models and the proposed model using subjective and objective metrics. Our proposed approach achieves better converted speech quality in terms of speaker similarity and naturalness.
引用
收藏
页码:2740 / 2744
页数:5
相关论文
共 13 条
  • [1] Cluster-based zero-shot learning for multivariate data
    Toshitaka Hayashi
    Hamido Fujita
    Journal of Ambient Intelligence and Humanized Computing, 2021, 12 : 1897 - 1911
  • [2] Cluster-based zero-shot learning for multivariate data
    Hayashi, Toshitaka
    Fujita, Hamido
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 12 (02) : 1897 - 1911
  • [3] Zero-shot voice conversion based on feature disentanglement
    Guo, Na
    Wei, Jianguo
    Li, Yongwei
    Lu, Wenhuan
    Tao, Jianhua
    SPEECH COMMUNICATION, 2024, 165
  • [4] Improving generalized zero-shot learning via cluster-based semantic disentangling representation
    Gao, Yi
    Feng, Wentao
    Xiao, Rong
    He, Lihuo
    He, Zhenan
    Lv, Jiancheng
    Tang, Chenwei
    PATTERN RECOGNITION, 2024, 150
  • [5] DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
    Yuan, Ruibin
    Wu, Yuxuan
    Li, Jacob
    Kim, Jaxter
    INTERSPEECH 2022, 2022, : 2593 - 2597
  • [6] Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment
    Sheng, Zheng-Yan
    Ai, Yang
    Chen, Yan-Nian
    Ling, Zhen-Hua
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8443 - 8452
  • [7] WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
    Rekimoto, Jun
    PROCEEDINGS OF THE 2023 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2023, 2023,
  • [8] LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models
    Wang Z.
    Chen Y.
    Xie L.
    Tian Q.
    Wang Y.
    IEEE Signal Processing Letters, 2023, 30 : 1157 - 1161
  • [9] GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus
    Zhang, Zining
    He, Bingsheng
    Zhang, Zhenjie
    INTERSPEECH 2020, 2020, : 791 - 795
  • [10] Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario
    Weng, Shao-En
    Shuai, Hong-Han
    Cheng, Wen-Huang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13718 - 13726