Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

被引:2
|
作者
Levkovitch, Alon [1 ]
Nachmani, Eliya [1 ,2 ]
Wolf, Lior [1 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
[2] Facebook AI Res, Tel Aviv, Israel
来源
基金
欧洲研究理事会;
关键词
D O I
10.21437/Interspeech.2022-10045
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short (similar to 3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising model with a low-pass version of the new speaker's sample. The objective and subjective evaluations show that our sampling method can generate a voice similar to that of the target speaker in terms of frequency, with an accuracy comparable to state-of-the-art methods, and without training.
引用
收藏
页码:2983 / 2987
页数:5
相关论文
共 50 条
  • [1] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
    Casanova, Edresson
    Weber, Julian
    Shulby, Christopher
    Candido Junior, Arnaldo
    Goelge, Eren
    Ponti, Moacir Antonelli
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [2] Zero-shot spatial layout conditioning for text-to-image diffusion models
    Couairon, Guillaume
    Careil, Marlene
    Cord, Matthieu
    Lathuiliere, Stephane
    Verbeek, Jakob
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2174 - 2183
  • [3] Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals
    Nercessian, Shahan
    [J]. INTERSPEECH 2020, 2020, : 4711 - 4715
  • [4] Text-to-Image Diffusion Models are Zero-Shot Classifiers
    Clark, Kevin
    Jaini, Priyank
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Efficient and consistent zero-shot video generation with diffusion models
    Frakes, Ethan
    Khalid, Umar
    Chen, Chen
    [J]. REAL-TIME IMAGE PROCESSING AND DEEP LEARNING 2024, 2024, 13034
  • [6] MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning
    Chen, Yi Ting
    Li, Wanting
    Tang, Buzhou
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)
  • [7] Zero-Shot AutoML with Pretrained Models
    Oeztuerk, Ekrem
    Ferreira, Fabio
    Jomaa, Hadi S.
    Schmidt-Thieme, Lars
    Grabocka, Josif
    Hutter, Frank
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [8] Zero-shot voice conversion based on feature disentanglement
    Guo, Na
    Wei, Jianguo
    Li, Yongwei
    Lu, Wenhuan
    Tao, Jianhua
    [J]. Speech Communication, 2024, 165
  • [9] Adaptive Conditional Denoising Diffusion Model With Hybrid Affinity Regularizer for Generalized Zero-Shot Learning
    Gao, Mengyu
    Dong, Qiulei
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5641 - 5652
  • [10] TRAINING ROBUST ZERO-SHOT VOICE CONVERSION MODELS WITH SELF-SUPERVISED FEATURES
    Trung Dang
    Dung Tran
    Chin, Peter
    Koishida, Kazuhito
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6557 - 6561