Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario

被引:0
|
作者
Weng, Shao-En [1 ]
Shuai, Hong-Han [1 ]
Cheng, Wen-Huang [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Hsinchu, Taiwan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Often a face has a voice. Appearance sometimes has a strong relationship with one's voice. In this work, we study how a face can be converted to a voice, which is a face-based voice conversion. Since there is no clean dataset that contains face and speech, voice conversion faces difficult learning and low-quality problems caused by background noise or echo. Too much redundant information for face-to-voice also causes synthesis of a general style of speech. Furthermore, previous work tried to disentangle speech with bottleneck adjustment. However, it is hard to decide on the size of the bottleneck. Therefore, we propose a bottleneck-free strategy for speech disentanglement. To avoid synthesizing the general style of speech, we utilize framewise facial embedding. It applied adversarial learning with a multi-scale discriminator for the model to achieve better quality. In addition, the self-attention module is added to focus on content-related features for in-the-wild data. Quantitative experiments show that our method outperforms previous work.
引用
收藏
页码:13718 / 13726
页数:9
相关论文
共 25 条
  • [21] Electric car life cycle assessment based on real-world mileage and the electric conversion scenario
    Helmers, Eckard
    Dietz, Johannes
    Hartard, Susanne
    INTERNATIONAL JOURNAL OF LIFE CYCLE ASSESSMENT, 2017, 22 (01): : 15 - 30
  • [22] Electric car life cycle assessment based on real-world mileage and the electric conversion scenario
    Eckard Helmers
    Johannes Dietz
    Susanne Hartard
    The International Journal of Life Cycle Assessment, 2017, 22 : 15 - 30
  • [23] LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance
    Chen, Shihao
    Gu, Yu
    Zhang, Jie
    Li, Na
    Chen, Rilin
    Chen, Liping
    Dai, Lirong
    INTERSPEECH 2024, 2024, : 2770 - 2774
  • [24] Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation
    Choi, Ha-Yeong
    Lee, Sang-Hoon
    Lee, Seong-Whan
    INTERSPEECH 2023, 2023, : 2283 - 2287
  • [25] GERRS: Removing Ghost Effects from Real-World Scenarios in 3D Pose estimation via Zero-shot Inference Approach
    Hossain, Md Imtiaz
    Akhter, Sharmen
    Mahbub, Md Nosin Lbna
    Hossain, Md Delowar
    Yang, Sungjun
    Huh, Eui-Nam
    2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023, 2023, : 1177 - 1183