Text2Performer: Text-Driven Human Video Generation

被引:1
|
作者
Jiang, Yuming [1 ]
Yang, Shuai [1 ]
Koh, Tong Liang [1 ]
Wu, Wayne [2 ]
Loy, Chen Change [1 ]
Liu, Ziwei [1 ]
机构
[1] Nanyang Technol Univ, S Lab, Singapore, Singapore
[2] Shanghai AI Lab, Shanghai, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.02079
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQdiffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatialtemporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512 x 256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/ projects/Text2Performer.html
引用
收藏
页码:22690 / 22700
页数:11
相关论文
共 50 条
  • [1] Text2Human: Text-Driven Controllable Human Image Generation
    Jiang, Yuming
    Yang, Shuai
    Qju, Haonan
    Wu, Wayne
    Loy, Chen Change
    Liu, Ziwei
    ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):
  • [2] Text-Driven Video Prediction
    Song, Xue
    Chen, Jingjing
    Zhu, Bin
    Jiang, Yu-gang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (09)
  • [3] Text2LIVE: Text-Driven Layered Image and Video Editing
    Bar-Tal, Omer
    Ofri-Amar, Dolev
    Fridman, Rafail
    Kasten, Yoni
    Dekel, Tali
    COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 707 - 723
  • [4] Text-driven Synchronized Diffusion Video and Audio Talking Head Generation
    Zhang, Zhenfei
    Huang, Tsung-Wei
    Su, Guan-Ming
    Chang, Ming-Ching
    Li, Xin
    2024 IEEE 7TH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL, MIPR 2024, 2024, : 61 - 67
  • [5] MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model
    Zhang, Mingyuan
    Cai, Zhongang
    Pan, Liang
    Hong, Fangzhou
    Guo, Xinying
    Yang, Lei
    Liu, Ziwei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4115 - 4128
  • [6] Text-driven human image generation with texture and pose control
    Jin, Zhedong
    Xia, Guiyu
    Yang, Paike
    Wang, Mengxiang
    Sun, Yubao
    Liu, Qingshan
    NEUROCOMPUTING, 2025, 634
  • [7] Open-Vocabulary Text-Driven Human Image Generation
    Zhang, Kaiduo
    Sun, Muyi
    Sun, Jianxin
    Zhang, Kunbo
    Sun, Zhenan
    Tan, Tieniu
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (10) : 4379 - 4397
  • [8] SceneScape: Text-Driven Consistent Scene Generation
    Fridman, Rafail
    Abecasis, Amit
    Kasten, Yoni
    Dekel, Tali
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] Text2Video: Text-driven facial animation using MPEG-4
    Rurainsky, J
    Eisert, P
    VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2005, PTS 1-4, 2005, 5960 : 492 - 500
  • [10] Correction: Open-Vocabulary Text-Driven Human Image Generation
    Kaiduo Zhang
    Muyi Sun
    Jianxin Sun
    Kunbo Zhang
    Zhenan Sun
    Tieniu Tan
    International Journal of Computer Vision, 2025, 133 (2) : 989 - 989