Text-Driven Video Prediction

被引:0
|
作者
Song, Xue [1 ]
Chen, Jingjing [1 ]
Zhu, Bin [2 ]
Jiang, Yu-gang [1 ]
机构
[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch CS, Shanghai, Peoples R China
[2] Singapore Management Univ, Singapore, Singapore
关键词
Text-driven Video Prediction; motion inference; controllable video generation;
D O I
10.1145/3675171
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image and text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this task is intrinsically a cause-and-effect problem, as the text content directly influences the motion changes of frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM), producing stepwise embeddings to regulate motion inference for subsequent frames. In particular, a refinement mechanism incorporating global motion semantics guarantees coherent generation. Extensive experiments are conducted on Something-Something V2 and Single Moving MNIST datasets. Experimental results demonstrate that our model achieves better results over other baselines, verifying the effectiveness of the proposed framework.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Automated gesturing for virtual characters: Speech-driven and text-driven approaches
    Zoric, G
    Smid, K
    Pandzic, IS
    ISPA 2005: PROCEEDINGS OF THE 4TH INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, 2005, : 295 - 300
  • [42] Multi-Region Text-Driven Manipulation of Diffusion Imagery
    Li, Yiming
    Zhou, Peng
    Sun, Jun
    Xu, Yi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3261 - 3269
  • [43] MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model
    Zhang, Mingyuan
    Cai, Zhongang
    Pan, Liang
    Hong, Fangzhou
    Guo, Xinying
    Yang, Lei
    Liu, Ziwei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4115 - 4128
  • [44] TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation
    Ye-Bin, Moon
    Kim, Jisoo
    Kim, Hongyeob
    Son, Kilho
    Oh, Tae-Hyun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2526 - 2537
  • [45] Text-Driven Generative Domain Adaptation with Spectral Consistency Regularization
    Liu, Zhenhuan
    Li, Liang
    Xiao, Jiayu
    Zha, Zheng-Jun
    Huang, Qingming
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6996 - 7006
  • [46] Text-driven human image generation with texture and pose control
    Jin, Zhedong
    Xia, Guiyu
    Yang, Paike
    Wang, Mengxiang
    Sun, Yubao
    Liu, Qingshan
    NEUROCOMPUTING, 2025, 634
  • [47] ARES: Text-Driven Automatic Realistic Simulator for Autonomous Traffic
    Cao, Jinghao
    Liu, Sheng
    Yang, Xiong
    Li, Yang
    Du, Sidan
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3049 - 3053
  • [48] ConIS: controllable text-driven image stylization with semantic intensity
    Yang, Gaoming
    Li, Changgeng
    Zhang, Ji
    MULTIMEDIA SYSTEMS, 2024, 30 (04)
  • [49] TexFit: Text-Driven Fashion Image Editing with Diffusion Models
    Wang, Tongxin
    Ye, Mang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 9, 2024, : 10198 - 10206
  • [50] Open-Vocabulary Text-Driven Human Image Generation
    Zhang, Kaiduo
    Sun, Muyi
    Sun, Jianxin
    Zhang, Kunbo
    Sun, Zhenan
    Tan, Tieniu
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (10) : 4379 - 4397