HARIVO: Harnessing Text-to-Image Models for Video Generation

被引:0
|
作者
Kwon, Mingi [1 ,2 ,5 ]
Oh, Seoung Wug [2 ]
Zhou, Yang [2 ]
Liu, Difan [2 ]
Lee, Joon-Young [2 ]
Cai, Haoran [2 ]
Liu, Baqiao [2 ,3 ]
Liu, Feng [2 ,4 ]
Uh, Youngjung [1 ]
机构
[1] Yonsei Univ, Seoul, South Korea
[2] Adobe, San Jose, CA 95110 USA
[3] Univ Illinois, Champaign, IL USA
[4] Portland State Univ, Portland, OR USA
[5] GivernyAI, Giverny, France
来源
基金
新加坡国家研究基金会;
关键词
D O I
10.1007/978-3-031-73668-1_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO/.
引用
收藏
页码:19 / 36
页数:18
相关论文
共 50 条
  • [31] Decoupling Control in Text-to-Image Diffusion Models
    Cao, Shitong
    Zhang, Xuejie
    Wang, Jin
    Zhou, Xiaobing
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VII, ICIC 2024, 2024, 14868 : 312 - 322
  • [32] A taxonomy of prompt modifiers for text-to-image generation
    Oppenlaender, Jonas
    BEHAVIOUR & INFORMATION TECHNOLOGY, 2024, 43 (15) : 3763 - 3776
  • [33] Evaluating Data Attribution for Text-to-Image Models
    Wang, Sheng-Yu
    Efros, Alexei A.
    Zhu, Jun-Yan
    Zhang, Richard
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7158 - 7169
  • [34] Multilingual Conceptual Coverage in Text-to-Image Models
    Saxon, Michael
    Wang, William Yang
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4831 - 4848
  • [35] Resolving Ambiguities in Text-to-Image Generative Models
    Mehrabi, Ninareh
    Goyal, Palash
    Verma, Apurv
    Dhamala, Jwala
    Kumar, Varun
    Hu, Qian
    Chang, Kai-Wei
    Zemel, Richard
    Galstyan, Aram
    Gupta, Rahul
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14367 - 14388
  • [36] Ablating Concepts in Text-to-Image Diffusion Models
    Kumari, Nupur
    Zhang, Bingliang
    Wang, Sheng-Yu
    Shechtman, Eli
    Zhang, Richard
    Zhu, Jun-Yan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22634 - 22645
  • [37] Typology of Risks of Generative Text-to-Image Models
    Bird, Charlotte
    Ungless, Eddie L.
    Kasirzadeh, Atoosa
    PROCEEDINGS OF THE 2023 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, AIES 2023, 2023, : 396 - 410
  • [38] SneakyPrompt: Jailbreaking Text-to-image Generative Models
    Yang, Yuchen
    Hui, Bo
    Yuan, Haolin
    Gong, Neil
    Cao, Yinzhi
    45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 897 - 912
  • [39] Text-to-Image Generation Method Based on Image-Text Semantic Consistency
    Xue Z.
    Xu Z.
    Lang C.
    Feng S.
    Wang T.
    Li Y.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2180 - 2190
  • [40] Generative adversarial text-to-image generation with style image constraint
    Zekang Wang
    Li Liu
    Huaxiang Zhang
    Dongmei Liu
    Yu Song
    Multimedia Systems, 2023, 29 : 3291 - 3303