Text-Driven Video Prediction

被引:0
|
作者
Song, Xue [1 ]
Chen, Jingjing [1 ]
Zhu, Bin [2 ]
Jiang, Yu-gang [1 ]
机构
[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch CS, Shanghai, Peoples R China
[2] Singapore Management Univ, Singapore, Singapore
关键词
Text-driven Video Prediction; motion inference; controllable video generation;
D O I
10.1145/3675171
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image and text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this task is intrinsically a cause-and-effect problem, as the text content directly influences the motion changes of frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM), producing stepwise embeddings to regulate motion inference for subsequent frames. In particular, a refinement mechanism incorporating global motion semantics guarantees coherent generation. Extensive experiments are conducted on Something-Something V2 and Single Moving MNIST datasets. Experimental results demonstrate that our model achieves better results over other baselines, verifying the effectiveness of the proposed framework.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Text-Driven Chinese Sign Language Synthesis
    徐琳
    高文
    晏洁
    Journal of Harbin Institute of Technology, 1998, (03) : 93 - 98
  • [22] SceneScape: Text-Driven Consistent Scene Generation
    Fridman, Rafail
    Abecasis, Amit
    Kasten, Yoni
    Dekel, Tali
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [23] A text-driven sign language synthesis system
    Gao, W
    Xu, L
    Yin, BC
    Liu, Y
    Song, YB
    Yan, J
    Zhou, J
    Chen, HT
    FIFTH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN & COMPUTER GRAPHICS, VOLS 1 AND 2, 1997, : 6 - 11
  • [24] Text-driven Speech Animation with Emotion Control
    Chae, Wonseok
    Kim, Yejin
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2020, 14 (08): : 3473 - 3487
  • [25] RESULTS OF A RCT: THE EFFECTS OF A VIDEO-DRIVEN AND TEXT-DRIVEN WEB-BASED OBESITY PREVENTION INTERVENTION
    Walthouwer, M. J. L.
    Oenema, A.
    Lechner, L.
    De Vries, H.
    INTERNATIONAL JOURNAL OF BEHAVIORAL MEDICINE, 2014, 21 : S151 - S151
  • [26] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
    Patashnik, Or
    Wu, Zongze
    Shechtman, Eli
    Cohen-Or, Daniel
    Lischinski, Dani
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2065 - 2074
  • [27] Text2Mesh: Text-Driven Neural Stylization for Meshes
    Michel, Oscar
    Bar-On, Roi
    Liu, Richard
    Benaim, Sagie
    Hanocka, Rana
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13482 - 13492
  • [28] Unsupervised Prompt Tuning for Text-Driven Object Detection
    He, Weizhen
    Chen, Weijie
    Chen, Binbin
    Yang, Shicai
    Xie, Di
    Lin, Luojun
    Qi, Donglian
    Zhuang, Yueting
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2651 - 2661
  • [29] Exploring Text-Driven Approaches for Online Action Detection
    Benavent-Lledo, Manuel
    Mulero-Perez, David
    Ortiz-Perez, David
    Garcia-Rodriguez, Jose
    Orts-Escolano, Sergio
    BIOINSPIRED SYSTEMS FOR TRANSLATIONAL APPLICATIONS: FROM ROBOTICS TO SOCIAL ENGINEERING, PT II, IWINAC 2024, 2024, 14675 : 55 - 64
  • [30] Blended Diffusion for Text-driven Editing of Natural Images
    Avrahami, Omri
    Lischinski, Dani
    Fried, Ohad
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18187 - 18197