Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

被引:0
|
作者
Le, Matthew [1 ]
Vyas, Apoorv [1 ]
Shi, Bowen [1 ]
Karrer, Brian [1 ]
Sari, Leda [1 ]
Moritz, Rashel [1 ]
Williamson, Mary [1 ]
Manohar, Vimal [1 ]
Adi, Yossi [1 ]
Mahadeokar, Jay [1 ]
Hsu, Wei-Ning [1 ]
机构
[1] Meta, Fundamental AI Res FAIR, New York, NY USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in https://voicebox.metademolab.com.
引用
收藏
页数:30
相关论文
共 50 条
  • [1] Learning Universal Policies via Text-Guided Video Generation
    Du, Yilun
    Yang, Mengjiao
    Dai, Bo
    Dai, Hanjun
    Nachum, Ofir
    Tenenbaum, Joshua B.
    Schuurmans, Dale
    Abbeel, Pieter
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] A Text-Guided Generation and Refinement Model for Image Captioning
    Wang, Depeng
    Hu, Zhenzhen
    Zhou, Yuanen
    Hong, Richang
    Wang, Meng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2966 - 2977
  • [3] Text-Guided Molecule Generation with Diffusion Language Model
    Gong, Haisong
    Liu, Qiang
    Wu, Shu
    Wang, Liang
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 1, 2024, : 109 - 117
  • [4] Rethinking Super-Resolution as Text-Guided Details Generation
    Ma, Chenxi
    Yan, Bo
    Lin, Qing
    Tan, Weimin
    Chen, Siming
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3461 - 3469
  • [5] TediGAN: Text-Guided Diverse Face Image Generation and Manipulation
    Xia, Weihao
    Yang, Yujiu
    Xue, Jing-Hao
    Wu, Baoyuan
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2256 - 2265
  • [6] Text-Guided Image Inpainting
    Zhang, Zijian
    Zhao, Zhou
    Zhang, Zhu
    Huai, Baoxing
    Yuan, Jing
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4079 - 4087
  • [7] Zero-Shot Text-Guided Object Generation with Dream Fields
    Jain, Ajay
    Mildenhall, Ben
    Barron, Jonathan T.
    Abbeel, Pieter
    Poole, Ben
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 857 - 866
  • [8] Towards Implicit Text-Guided 3D Shape Generation
    Liu, Zhengzhe
    Wang, Yi
    Qi, Xiaojuan
    Fu, Chi-Wing
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17875 - 17885
  • [9] Text-Guided Prototype Generation for Occluded Person Re-Identification
    Jiang, Min
    Liu, Xinyu
    Kong, Jun
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2350 - 2354
  • [10] Text-Guided Vector Graphics Customization
    Zhang, Peiying
    Zhao, Nanxuan
    Liao, Jing
    [J]. PROCEEDINGS OF THE SIGGRAPH ASIA 2023 CONFERENCE PAPERS, 2023,