Text2Performer: Text-Driven Human Video Generation

被引：1

作者：

Jiang, Yuming ^{[1
]}

Yang, Shuai ^{[1
]}

Koh, Tong Liang ^{[1
]}

Wu, Wayne ^{[2
]}

Loy, Chen Change ^{[1
]}

Liu, Ziwei ^{[1
]}

机构：

[1] Nanyang Technol Univ, S Lab, Singapore, Singapore

[2] Shanghai AI Lab, Shanghai, Peoples R China

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.02079

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQdiffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatialtemporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512 x 256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/ projects/Text2Performer.html

引用

页码：22690 / 22700

页数：11

共 50 条

[1] Text2Human: Text-Driven Controllable Human Image Generation
Jiang, Yuming
Yang, Shuai
Qju, Haonan
Wu, Wayne
Loy, Chen Change
Liu, Ziwei
ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):
[2] Text-Driven Video Prediction
Song, Xue
Chen, Jingjing
Zhu, Bin
Jiang, Yu-gang
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (09)
[3] Text2LIVE: Text-Driven Layered Image and Video Editing
Bar-Tal, Omer
Ofri-Amar, Dolev
Fridman, Rafail
Kasten, Yoni
Dekel, Tali
COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 707 - 723
[4] Text-driven Synchronized Diffusion Video and Audio Talking Head Generation
Zhang, Zhenfei
Huang, Tsung-Wei
Su, Guan-Ming
Chang, Ming-Ching
Li, Xin
2024 IEEE 7TH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL, MIPR 2024, 2024, : 61 - 67
[5] MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model
Zhang, Mingyuan
Cai, Zhongang
Pan, Liang
Hong, Fangzhou
Guo, Xinying
Yang, Lei
Liu, Ziwei
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4115 - 4128
[6] Text-driven human image generation with texture and pose control
Jin, Zhedong
Xia, Guiyu
Yang, Paike
Wang, Mengxiang
Sun, Yubao
Liu, Qingshan
NEUROCOMPUTING, 2025, 634
[7] Open-Vocabulary Text-Driven Human Image Generation
Zhang, Kaiduo
Sun, Muyi
Sun, Jianxin
Zhang, Kunbo
Sun, Zhenan
Tan, Tieniu
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (10) : 4379 - 4397
[8] SceneScape: Text-Driven Consistent Scene Generation
Fridman, Rafail
Abecasis, Amit
Kasten, Yoni
Dekel, Tali
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[9] Text2Video: Text-driven facial animation using MPEG-4
Rurainsky, J
Eisert, P
VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2005, PTS 1-4, 2005, 5960 : 492 - 500
[10] Correction: Open-Vocabulary Text-Driven Human Image Generation
Kaiduo Zhang
Muyi Sun
Jianxin Sun
Kunbo Zhang
Zhenan Sun
Tieniu Tan
International Journal of Computer Vision, 2025, 133 (2) : 989 - 989

← 1 2 3 4 5 →