ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation

被引：1

作者：

Liu, Jiawei ^{[1
,2
]}

Wang, Weining ^{[2
]}

Liu, Wei ^{[3
]}

He, Qian ^{[3
]}

Liu, Jing ^{[1
,2
]}

机构：

[1] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Automat, Lab Cognit & Decis Intelligence Complex Syst, Beijing, Peoples R China

[3] ByteDance Inc, Beijing, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/IJCNN54540.2023.10191565

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Diffusion models have achieved remarkable performance on image generation. However, It is difficult to reproduce this success on video generation because of expensive training cost. In fact, pretrained image generation models have already acquired visual generation capabilities and could be utilized for video generation. Thus, we propose an Efficient training framework for Diffusion-based Text-to-Video generation (ED-T2V), which is built on a pretrained text-to-image generation model. To model the temporal dynamic information, we propose temporal transformer blocks with novel identity attention and temporal cross-attention. ED-T2V has the following advantages: 1) most of the parameters of pretrained model are frozen to inherit the generation capabilities and reduce the training cost; 2) the identity attention requires the currently generated frame to attend to all positions of its previous frame, thus providing an efficient way to keep main content consistent across frames and enable movement generation; 3) the temporal cross-attention is proposed to construct associations between textual descriptions and multiple video tokens in the time dimension, which could better model video movement than traditional cross-attention methods. With the aforementioned benefits, ED-T2V not only significantly reduces the training cost of video diffusion models, but also has excellent generation fidelity and controllability.

引用

页数：8

共 5 条

[1] MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text
Zhu, Junchen
Yang, Huan
Wang, Wenjing
He, Huiguo
Tuo, Zixi
Yu, Yongsheng
Cheng, Wen-Huang
Gao, Lianli
Song, Jingkuan
Fu, Jianlong
Luo, Jiebo
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9371 - 9373
[2] V2T: video to text framework using a novel automatic shot boundary detection algorithm
Singh, Alok
Singh, Thoudam Doren
Bandyopadhyay, Sivaji
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (13) : 17989 - 18009
[3] V2T: video to text framework using a novel automatic shot boundary detection algorithm
Alok Singh
Thoudam Doren Singh
Sivaji Bandyopadhyay
Multimedia Tools and Applications, 2022, 81 : 17989 - 18009
[4] M2T: A Framework of Spatial Scene Description Text Generation based on Multi-source Knowledge Graph Fusion
Chen H.
Guo D.
Ge S.
Wang J.
Wang Y.
Chen F.
Yang W.
Journal of Geo-Information Science, 2023, 25 (06) : 1176 - 1185
[5] M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Dong, Xingning
Feng, Zipeng
Zhou, Chunluan
Yu, Xuzheng
Yang, Ming
Guo, Qingpei
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2156 - 2166

← 1 →