HARIVO: Harnessing Text-to-Image Models for Video Generation

被引：0

作者：

Kwon, Mingi ^{[1
,2
,5
]}

Oh, Seoung Wug ^{[2
]}

Zhou, Yang ^{[2
]}

Liu, Difan ^{[2
]}

Lee, Joon-Young ^{[2
]}

Cai, Haoran ^{[2
]}

Liu, Baqiao ^{[2
,3
]}

Liu, Feng ^{[2
,4
]}

Uh, Youngjung ^{[1
]}

机构：

[1] Yonsei Univ, Seoul, South Korea

[2] Adobe, San Jose, CA 95110 USA

[3] Univ Illinois, Champaign, IL USA

[4] Portland State Univ, Portland, OR USA

[5] GivernyAI, Giverny, France

来源：

COMPUTER VISION - ECCV 2024, PT LIII | 2025年 / 15111卷

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1007/978-3-031-73668-1_2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO/.

引用

页码：19 / 36

页数：18

共 50 条

[41] Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis
Wu, Qiucheng
Liu, Yujian
Zhao, Handong
Bui, Trung
Lin, Zhe
Zhang, Yang
Chang, Shiyu
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7732 - 7742
[42] Generative adversarial text-to-image generation with style image constraint
Wang, Zekang
Liu, Li
Zhang, Huaxiang
Liu, Dongmei
Song, Yu
MULTIMEDIA SYSTEMS, 2023, 29 (06) : 3291 - 3303
[43] SINE: SINgle Image Editing with Text-to-Image Diffusion Models
Zhang, Zhixing
Han, Ligong
Ghosh, Arnab
Metaxas, Dimitris
Ren, Jian
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6027 - 6037
[44] Improving text-to-image generation with object layout guidance
Jezia Zakraoui
Moutaz Saleh
Somaya Al-Maadeed
Jihad Mohammed Jaam
Multimedia Tools and Applications, 2021, 80 : 27423 - 27443
[45] DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Ruiz, Nataniel
Li, Yuanzhen
Jampani, Varun
Pritch, Yael
Rubinstein, Michael
Aberman, Kfir
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22500 - 22510
[46] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
Qu, Yiting
Shen, Xinyue
He, Xinlei
Backes, Michael
Zannettou, Savvas
Zhang, Yang
PROCEEDINGS OF THE 2023 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, CCS 2023, 2023, : 3403 - 3417
[47] Large-scale Text-to-Image Generation Models for Visual Artists' Creative Works
Ko, Hyung-Kwon
Park, Gwanmo
Jeon, Hyeon
Jo, Jaemin
Kim, Juho
Seo, Jinwook
PROCEEDINGS OF 2023 28TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2023, 2023, : 919 - 933
[48] Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models
Huang, Jia-Hong
Zhu, Hongyi
Shen, Yixian
Rudinac, Stevan
Kanoulas, Evangelos
MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 413 - 427
[49] Variational Distribution Learning for Unsupervised Text-to-Image Generation
Kang, Minsoo
Lee, Doyup
Kim, Jiseob
Kim, Saehoon
Han, Bohyung
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23380 - 23389
[50] HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
Narasimhaswamy, Supreeth
Bhattacharya, Uttaran
Chen, Xiang
Dasgupta, Ishita
Mitra, Saayan
Hoai, Minh
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 2468 - 2479

← 1 2 3 4 5 →