ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation

被引:1
|
作者
Liu, Jiawei [1 ,2 ]
Wang, Weining [2 ]
Liu, Wei [3 ]
He, Qian [3 ]
Liu, Jing [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Lab Cognit & Decis Intelligence Complex Syst, Beijing, Peoples R China
[3] ByteDance Inc, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/IJCNN54540.2023.10191565
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Diffusion models have achieved remarkable performance on image generation. However, It is difficult to reproduce this success on video generation because of expensive training cost. In fact, pretrained image generation models have already acquired visual generation capabilities and could be utilized for video generation. Thus, we propose an Efficient training framework for Diffusion-based Text-to-Video generation (ED-T2V), which is built on a pretrained text-to-image generation model. To model the temporal dynamic information, we propose temporal transformer blocks with novel identity attention and temporal cross-attention. ED-T2V has the following advantages: 1) most of the parameters of pretrained model are frozen to inherit the generation capabilities and reduce the training cost; 2) the identity attention requires the currently generated frame to attend to all positions of its previous frame, thus providing an efficient way to keep main content consistent across frames and enable movement generation; 3) the temporal cross-attention is proposed to construct associations between textual descriptions and multiple video tokens in the time dimension, which could better model video movement than traditional cross-attention methods. With the aforementioned benefits, ED-T2V not only significantly reduces the training cost of video diffusion models, but also has excellent generation fidelity and controllability.
引用
收藏
页数:8
相关论文
共 5 条
  • [1] MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text
    Zhu, Junchen
    Yang, Huan
    Wang, Wenjing
    He, Huiguo
    Tuo, Zixi
    Yu, Yongsheng
    Cheng, Wen-Huang
    Gao, Lianli
    Song, Jingkuan
    Fu, Jianlong
    Luo, Jiebo
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9371 - 9373
  • [2] V2T: video to text framework using a novel automatic shot boundary detection algorithm
    Singh, Alok
    Singh, Thoudam Doren
    Bandyopadhyay, Sivaji
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (13) : 17989 - 18009
  • [3] V2T: video to text framework using a novel automatic shot boundary detection algorithm
    Alok Singh
    Thoudam Doren Singh
    Sivaji Bandyopadhyay
    Multimedia Tools and Applications, 2022, 81 : 17989 - 18009
  • [4] M2T: A Framework of Spatial Scene Description Text Generation based on Multi-source Knowledge Graph Fusion
    Chen H.
    Guo D.
    Ge S.
    Wang J.
    Wang Y.
    Chen F.
    Yang W.
    Journal of Geo-Information Science, 2023, 25 (06) : 1176 - 1185
  • [5] M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
    Dong, Xingning
    Feng, Zipeng
    Zhou, Chunluan
    Yu, Xuzheng
    Yang, Ming
    Guo, Qingpei
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2156 - 2166