Text2Performer: Text-Driven Human Video Generation

被引:1
|
作者
Jiang, Yuming [1 ]
Yang, Shuai [1 ]
Koh, Tong Liang [1 ]
Wu, Wayne [2 ]
Loy, Chen Change [1 ]
Liu, Ziwei [1 ]
机构
[1] Nanyang Technol Univ, S Lab, Singapore, Singapore
[2] Shanghai AI Lab, Shanghai, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.02079
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQdiffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatialtemporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512 x 256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/ projects/Text2Performer.html
引用
收藏
页码:22690 / 22700
页数:11
相关论文
共 50 条
  • [21] StableVideo: Text-driven Consistency-aware Diffusion Video Editing
    Chai, Wenhao
    Guo, Xun
    Wang, Gaoang
    Lu, Yan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22983 - 22993
  • [22] The Framework of Text-driven Business Intelligence
    Zhou, Ning
    Cheng, Hongli
    Chen, Hongqin
    Xiao, Shuang
    2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-15, 2007, : 5468 - 5471
  • [23] CLIPTexture: Text-driven Texture Synthesis
    Song, Yiren
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5468 - 5476
  • [24] Text-Driven Separation of Arbitrary Sounds
    Kilgour, Kevin
    Gfeller, Beat
    Huang, Qingqing
    Jansen, Aren
    Wisdom, Scott
    Tagliasacchi, Marco
    INTERSPEECH 2022, 2022, : 5403 - 5407
  • [25] Text2Tex: Text-driven Texture Synthesis via Diffusion Models
    Chen, Dave Zhenyu
    Siddiqui, Yawar
    Lee, Hsin-Ying
    Tulyakov, Sergey
    Niessner, Matthias
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 18512 - 18522
  • [26] ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors
    Chen, Jingwen
    Pan, Yingwei
    Yao, Ting
    Mei, Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7540 - 7548
  • [27] Text-Driven Data Augmentation Tool for Synthetic Bird Behavioural Generation
    Mulero-Perez, David
    Ortiz-Perez, David
    Benavent-Lledo, Manuel
    Garcia-Rodriguez, Jose
    Azorin-Lopez, Jorge
    BIOINSPIRED SYSTEMS FOR TRANSLATIONAL APPLICATIONS: FROM ROBOTICS TO SOCIAL ENGINEERING, PT II, IWINAC 2024, 2024, 14675 : 75 - 84
  • [28] AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism
    Zhong, Chongyang
    Hu, Lei
    Zhang, Zihao
    Xia, Shihong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 509 - 519
  • [29] 3D-Aware Text-Driven Talking Avatar Generation
    Wu, Xiuzhe
    Sun, Yang-Tian
    Chen, Handi
    Zhou, Hang
    Wang, Jingdong
    Liu, Zhengzhe
    Qi, Xiaojuan
    COMPUTER VISION - ECCV 2024, PT LXXXVIII, 2025, 15146 : 416 - 433
  • [30] A statistical parametric approach to video-realistic text-driven talking avatar
    Lei Xie
    Naicai Sun
    Bo Fan
    Multimedia Tools and Applications, 2014, 73 : 377 - 396