Text-Driven Video Prediction

被引:0
|
作者
Song, Xue [1 ]
Chen, Jingjing [1 ]
Zhu, Bin [2 ]
Jiang, Yu-gang [1 ]
机构
[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch CS, Shanghai, Peoples R China
[2] Singapore Management Univ, Singapore, Singapore
关键词
Text-driven Video Prediction; motion inference; controllable video generation;
D O I
10.1145/3675171
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image and text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this task is intrinsically a cause-and-effect problem, as the text content directly influences the motion changes of frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM), producing stepwise embeddings to regulate motion inference for subsequent frames. In particular, a refinement mechanism incorporating global motion semantics guarantees coherent generation. Extensive experiments are conducted on Something-Something V2 and Single Moving MNIST datasets. Experimental results demonstrate that our model achieves better results over other baselines, verifying the effectiveness of the proposed framework.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Comparing text-driven and speech-driven visual speech synthesisers
    Theobald, Barry-John
    Cawley, Gavin
    Bangham, Andrew
    Matthews, Iain
    Wilkinson, Nicholas
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2322 - 2322
  • [32] Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection
    Zhou, Siyu
    Zhang, Fjwei
    Wang, Ruomei
    Su, Zhuo
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 254 - 268
  • [33] TELL YOUR STORY: TEXT-DRIVEN FACE VIDEO SYNTHESIS WITH HIGH DIVERSITY VIA ADVERSARIAL LEARNING
    Hou, Xia
    Sun, Meng
    Song, Wenfeng
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 515 - 519
  • [34] A Fast Text-Driven Approach for Generating Artistic Content
    Lupascu, Marian
    Murdock, Ryan
    Mironica, Ionut
    Li, Yijun
    PROCEEDINGS OF SIGGRAPH 2022 POSTERS, SIGGRAPH 2022, 2022,
  • [35] Systematic development of a text-driven and a video-driven web-based computer-tailored obesity prevention intervention
    Michel Jean Louis Walthouwer
    Anke Oenema
    Katja Soetens
    Lilian Lechner
    Hein De Vries
    BMC Public Health, 13
  • [36] Systematic development of a text-driven and a video-driven web-based computer-tailored obesity prevention intervention
    Walthouwer, Michel Jean Louis
    Oenema, Anke
    Soetens, Katja
    Lechner, Lilian
    De Vries, Hein
    BMC PUBLIC HEALTH, 2013, 13
  • [37] DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation
    Lyu, Yueming
    Lin, Tianwei
    Li, Fu
    He, Dongliang
    Dong, Jing
    Tan, Tieniu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6894 - 6903
  • [38] Lightweight Text-Driven Image Editing With Disentangled Content and Attributes
    Li, Bo
    Lin, Xiao
    Liu, Bin
    He, Zhi-Fen
    Lai, Yu-Kun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1829 - 1841
  • [39] Semiotic modeling of the text-driven conceptual paradigm in language education
    Li, Yufeng
    CHINESE SEMIOTIC STUDIES, 2021, 17 (04) : 661 - 683
  • [40] Text2Human: Text-Driven Controllable Human Image Generation
    Jiang, Yuming
    Yang, Shuai
    Qju, Haonan
    Wu, Wayne
    Loy, Chen Change
    Liu, Ziwei
    ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):