Text2Performer: Text-Driven Human Video Generation

被引:0
|
作者
Jiang, Yuming [1 ]
Yang, Shuai [1 ]
Koh, Tong Liang [1 ]
Wu, Wayne [2 ]
Loy, Chen Change [1 ]
Liu, Ziwei [1 ]
机构
[1] Nanyang Technol Univ, S Lab, Singapore, Singapore
[2] Shanghai AI Lab, Shanghai, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.02079
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQdiffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatialtemporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512 x 256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/ projects/Text2Performer.html
引用
收藏
页码:22690 / 22700
页数:11
相关论文
共 50 条
  • [1] Text2Human: Text-Driven Controllable Human Image Generation
    Jiang, Yuming
    Yang, Shuai
    Qju, Haonan
    Wu, Wayne
    Loy, Chen Change
    Liu, Ziwei
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):
  • [2] Text-Driven Video Prediction
    Song, Xue
    Chen, Jingjing
    Zhu, Bin
    Jiang, Yu-Gang
    [J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20 (09)
  • [3] Text2LIVE: Text-Driven Layered Image and Video Editing
    Bar-Tal, Omer
    Ofri-Amar, Dolev
    Fridman, Rafail
    Kasten, Yoni
    Dekel, Tali
    [J]. COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 707 - 723
  • [4] MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model
    Zhang, Mingyuan
    Cai, Zhongang
    Pan, Liang
    Hong, Fangzhou
    Guo, Xinying
    Yang, Lei
    Liu, Ziwei
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4115 - 4128
  • [5] Open-Vocabulary Text-Driven Human Image Generation
    Zhang, Kaiduo
    Sun, Muyi
    Sun, Jianxin
    Zhang, Kunbo
    Sun, Zhenan
    Tan, Tieniu
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, : 4379 - 4397
  • [6] SceneScape: Text-Driven Consistent Scene Generation
    Fridman, Rafail
    Abecasis, Amit
    Kasten, Yoni
    Dekel, Tali
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Text2Video: Text-driven facial animation using MPEG-4
    Rurainsky, J
    Eisert, P
    [J]. VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2005, PTS 1-4, 2005, 5960 : 492 - 500
  • [8] Text2Light. Zero-Shot Text-Driven HDR Panorama Generation
    Chen, Zhaoxi
    Wang, Guangcong
    Liu, Ziwei
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (06):
  • [9] Text2Palette: Text-Driven Color Palette Generation Using Internet Images
    Lei, Kaixiang
    Liu, Zhengning
    Xu, Kun
    [J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2021, 33 (05): : 694 - 703
  • [10] TEXT2VIDEO: TEXT-DRIVEN TALKING-HEAD VIDEO SYNTHESIS WITH PERSONALIZED PHONEME - POSE DICTIONARY
    Zhang, Sibo
    Yuan, Jiahong
    Liao, Miao
    Zhang, Liangjun
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2659 - 2663