Factorizing Text-to-Video Generation by Explicit Image Conditioning

被引:0
|
作者
Girdhar, Rohit [1 ]
Singh, Mannat [1 ]
Brown, Andrew [1 ]
Duval, Quentin [1 ]
Azadi, Samaneh [1 ]
Rambhatla, Sai Saketh [1 ]
Shah, Akbar [1 ]
Yin, Xi [1 ]
Parikh, Devi [1 ]
Misra, Ishan [1 ]
机构
[1] Meta, GenAI, New York, NY 10003 USA
来源
关键词
D O I
10.1007/978-3-031-73033-7_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions-adjusted noise schedules for diffusion, and multi-stage training-that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work-81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.
引用
收藏
页码:205 / 224
页数:20
相关论文
共 50 条
  • [21] Holistic Features are almost Sufficient for Text-to-Video Retrieval
    Tian, Kaibin
    Zhao, Ruixiang
    Xin, Zijie
    Lan, Bangxiang
    Li, Xirong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 17138 - 17147
  • [22] Modeling Accounting Workplace Interactions with Text-to-Video Animation
    Phillips, Fred
    Sheehan, Norman T.
    ACCOUNTING PERSPECTIVES, 2013, 12 (01) : 75 - 87
  • [23] MotionDirector: Motion Customization of Text-to-Video Diffusion Models
    Zhao, Rui
    Gu, Yuchao
    Wu, Jay Zhangjie
    Zhang, David Junhao
    Liu, Jia-Wei
    Wu, Weijia
    Keppo, Jussi
    Shou, Mike Zheng
    COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 273 - 290
  • [24] ODD-VGAN: Optimised Dual Discriminator Video Generative Adversarial Network for Text-to-Video Generation with Heuristic Strategy
    Mehmood, Rayeesa
    Bashir, Rumaan
    Giri, Kaiser J. J.
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2023,
  • [25] An Investigation into the Issues Concerning the Copyright of Content Generated by Text-to-Video AI
    Zhou Chunguang
    Yi Jia
    Contemporary Social Sciences, 2024, 9 (05) : 95 - 117
  • [26] SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
    Guo, Yuwei
    Yang, Ceyuan
    Rao, Anyi
    Agrawala, Maneesh
    Lin, Dahua
    Dai, Bo
    COMPUTER VISION - ECCV 2024, PT XLII, 2025, 15100 : 330 - 348
  • [27] Text-to-video generative artificial intelligence: sora in neurosurgery: correspondence
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    NEUROSURGICAL REVIEW, 2024, 47 (01)
  • [28] A dataset of text prompts, videos and video quality metrics from generative text-to-video AI models
    Chivileva, Iya
    Lynch, Philip
    Ward, Tomas E.
    Smeaton, Alan F.
    DATA IN BRIEF, 2024, 54
  • [29] LCGD: Enhancing Text-to-Video Generation via Contextual LLM Guidance and U-Net Denoising
    Waseem, Muhammad
    Khan, Muhammad Usman Ghani
    Khurshid, Syed Khaldoon
    IEEE ACCESS, 2025, 13 : 47068 - 47085
  • [30] A Benchmark for Controllable Text -Image-to-Video Generation
    Hu, Yaosi
    Luo, Chong
    Chen, Zhenzhong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1706 - 1719