Factorizing Text-to-Video Generation by Explicit Image Conditioning

被引：0

作者：

Girdhar, Rohit ^{[1
]}

Singh, Mannat ^{[1
]}

Brown, Andrew ^{[1
]}

Duval, Quentin ^{[1
]}

Azadi, Samaneh ^{[1
]}

Rambhatla, Sai Saketh ^{[1
]}

Shah, Akbar ^{[1
]}

Yin, Xi ^{[1
]}

Parikh, Devi ^{[1
]}

Misra, Ishan ^{[1
]}

机构：

[1] Meta, GenAI, New York, NY 10003 USA

来源：

COMPUTER VISION - ECCV 2024, PT LXII | 2025年 / 15120卷

关键词：

D O I：

10.1007/978-3-031-73033-7_12

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions-adjusted noise schedules for diffusion, and multi-stage training-that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work-81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.

引用

页码：205 / 224

页数：20

共 50 条

[21] Holistic Features are almost Sufficient for Text-to-Video Retrieval
Tian, Kaibin
Zhao, Ruixiang
Xin, Zijie
Lan, Bangxiang
Li, Xirong
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 17138 - 17147
[22] Modeling Accounting Workplace Interactions with Text-to-Video Animation
Phillips, Fred
Sheehan, Norman T.
ACCOUNTING PERSPECTIVES, 2013, 12 (01) : 75 - 87
[23] MotionDirector: Motion Customization of Text-to-Video Diffusion Models
Zhao, Rui
Gu, Yuchao
Wu, Jay Zhangjie
Zhang, David Junhao
Liu, Jia-Wei
Wu, Weijia
Keppo, Jussi
Shou, Mike Zheng
COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 273 - 290
[24] ODD-VGAN: Optimised Dual Discriminator Video Generative Adversarial Network for Text-to-Video Generation with Heuristic Strategy
Mehmood, Rayeesa
Bashir, Rumaan
Giri, Kaiser J. J.
JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2023,
[25] An Investigation into the Issues Concerning the Copyright of Content Generated by Text-to-Video AI
Zhou Chunguang
Yi Jia
Contemporary Social Sciences, 2024, 9 (05) : 95 - 117
[26] SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
Guo, Yuwei
Yang, Ceyuan
Rao, Anyi
Agrawala, Maneesh
Lin, Dahua
Dai, Bo
COMPUTER VISION - ECCV 2024, PT XLII, 2025, 15100 : 330 - 348
[27] Text-to-video generative artificial intelligence: sora in neurosurgery: correspondence
Daungsupawong, Hinpetch
Wiwanitkit, Viroj
NEUROSURGICAL REVIEW, 2024, 47 (01)
[28] A dataset of text prompts, videos and video quality metrics from generative text-to-video AI models
Chivileva, Iya
Lynch, Philip
Ward, Tomas E.
Smeaton, Alan F.
DATA IN BRIEF, 2024, 54
[29] LCGD: Enhancing Text-to-Video Generation via Contextual LLM Guidance and U-Net Denoising
Waseem, Muhammad
Khan, Muhammad Usman Ghani
Khurshid, Syed Khaldoon
IEEE ACCESS, 2025, 13 : 47068 - 47085
[30] A Benchmark for Controllable Text -Image-to-Video Generation
Hu, Yaosi
Luo, Chong
Chen, Zhenzhong
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1706 - 1719

← 1 2 3 4 5 →