HARIVO: Harnessing Text-to-Image Models for Video Generation

被引:0
|
作者
Kwon, Mingi [1 ,2 ,5 ]
Oh, Seoung Wug [2 ]
Zhou, Yang [2 ]
Liu, Difan [2 ]
Lee, Joon-Young [2 ]
Cai, Haoran [2 ]
Liu, Baqiao [2 ,3 ]
Liu, Feng [2 ,4 ]
Uh, Youngjung [1 ]
机构
[1] Yonsei Univ, Seoul, South Korea
[2] Adobe, San Jose, CA 95110 USA
[3] Univ Illinois, Champaign, IL USA
[4] Portland State Univ, Portland, OR USA
[5] GivernyAI, Giverny, France
来源
基金
新加坡国家研究基金会;
关键词
D O I
10.1007/978-3-031-73668-1_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO/.
引用
收藏
页码:19 / 36
页数:18
相关论文
共 50 条
  • [1] Towards Consistent Video Editing with Text-to-Image Diffusion Models
    Zhang, Zicheng
    Li, Bonan
    Nie, Xuecheng
    Han, Congying
    Guo, Tiande
    Liu, Luoqi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
    Yang, Jingyuan
    Feng, Jiawei
    Huang, Hui
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 6358 - 6368
  • [3] Controllable Text-to-Image Generation
    Li, Bowen
    Qi, Xiaojuan
    Lukasiewicz, Thomas
    Torr, Philip H. S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [4] Surgical text-to-image generation
    Nwoye, Chinedu Innocent
    Bose, Rupak
    Elgohary, Kareem
    Arboit, Lorenzo
    Carlino, Giorgio
    Lavanchy, Joel L.
    Mascagni, Pietro
    Padoy, Nicolas
    PATTERN RECOGNITION LETTERS, 2025, 190 : 73 - 80
  • [5] Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation
    Peng, Duo
    Zhang, Zhengbo
    Hu, Ping
    Ke, Qiuhong
    Yaul, David K. Y.
    Liu, Jun
    COMPUTER VISION - ECCV 2024, PT XIII, 2025, 15071 : 342 - 360
  • [6] Prompt Stealing Attacks Against Text-to-Image Generation Models
    Shen, Xinyue
    Qu, Yiting
    Backes, Michael
    Zhang, Yang
    PROCEEDINGS OF THE 33RD USENIX SECURITY SYMPOSIUM, SECURITY 2024, 2024, : 5823 - 5840
  • [7] Expressive Text-to-Image Generation with Rich Text
    Ge, Songwei
    Park, Taesung
    Zhu, Jun-Yan
    Huang, Jia-Bin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7511 - 7522
  • [8] Video Colorization with Pre-trained Text-to-Image Diffusion Models
    Liu, Hanyuan
    Xie, Minshan
    Xing, Jinbo
    Li, Chengze
    Wong, Tien-Tsin
    arXiv, 2023,
  • [9] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
    Zhao, Shihao
    Shaozhe, Hao
    Zi, Bojia
    Xu, Huaizhe
    Kwan-Yee K Wone
    COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 70 - 86
  • [10] SEMANTICALLY INVARIANT TEXT-TO-IMAGE GENERATION
    Sah, Shagan
    Peri, Dheeraj
    Shringi, Ameya
    Zhang, Chi
    Dominguez, Miguel
    Savakis, Andreas
    Ptucha, Ray
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 3783 - 3787