TA2V: Text-Audio Guided Video Generation

被引：0

作者：

Zhao, Minglu ^{[1
]}

Wang, Wenmin ^{[1
]}

Chen, Tongbao ^{[1
]}

Zhang, Rui ^{[2
]}

Li, Ruochen ^{[1
]}

机构：

[1] Macau Univ Sci & Technol, Sch Comp Sci & Engn, Macau 999078, Peoples R China

[2] Beijing Inst Technol, Sch Mech Engn, Beijing 100811, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Multimodal video generation; text-audio to video; VQ-GAN; diffusion; deep learning; NETWORK;

D O I：

10.1109/TMM.2024.3362149

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent conditional and unconditional video generation tasks have been accomplished mainly based on generative adversarial network (GAN), diffusion, and autoregressive models. However, in some circumstances, using only one modality cannot provide enough semantic information. Therefore, in this paper, we propose text-audio to video (TA2V) generation, a new task for generating realistic videos from two different guided modalities, text and audio, which has not been explored much thus far. Compared to image generation, video generation is a harder task because of the complexity of processing higher-dimensional data and scarcer suitable datasets, especially for multimodal video generation. To overcome these limitations, (i) we propose the Text&Audio-guided-Video-Maker (TAgVM) model, which consists of two modules: a text-guided video generator and a text&audio-guided video modifier. (ii) This model uses a 3D VQ-GAN to compress high-dimension video data to a low-dimension discrete sequence, followed by an autoregressive model to guide text-conditional generation in the latent space. Then, we apply a text&audio-guided diffusion model to the generated video scenes, providing additional semantic details corresponding to the audio and text. (iii) We introduce a newly produced music performance video dataset, the University of Rochester Multimodal Music Performance with Video-Audio-Text (URMP-VAT), and a landscape dataset, Landscape with Video-Audio-Text (Landscape-VAT), both of which include three modalities (text, audio, and video) that are aligned with each other. The results demonstrate that our model can create videos with satisfactory quality and semantic information.

引用

页码：7250 / 7264

页数：15

共 21 条

[1] MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
Hayes, Thomas
Zhang, Songyang
Yin, Xi
Pang, Guan
Sheng, Sasha
Yang, Harry
Ge, Songwei
Hu, Qiyuan
Parikh, Devi
COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 431 - 449
[2] Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
Yariv, Guy
Gat, Itai
Benaim, Sagie
Wolf, Lior
Schwartz, Idan
Adi, Yossi
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6639 - 6647
[3] Text2Video: Automatic Video Generation Based on Text Scripts
Yu, Yipeng
Tu, Zirui
Lu, Longyu
Chen, Xiao
Zhan, Hui
Sun, Zixun
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2753 - 2755
[4] Automated generation of news content hierarchy by integrating audio, video, and text information
Huang, Q
Liu, Z
Rosenberg, A
Gibbon, D
Shahraray, B
ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 3025 - 3028
[5] Sounding Video Generator: A Unified Framework for Text-Guided Sounding Video Generation
Liu, Jiawei
Wang, Weining
Chen, Sihan
Zhu, Xinxin
Liu, Jing
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 141 - 153
[6] Learning Universal Policies via Text-Guided Video Generation
Du, Yilun
Yang, Mengjiao
Dai, Bo
Dai, Hanjun
Nachum, Ofir
Tenenbaum, Joshua B.
Schuurmans, Dale
Abbeel, Pieter
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] Text-to-Audio Generation using Instruction-Guided Latent Diffusion Model
Ghosal, Deepanway
Majumder, Navonil
Mehrish, Ambuj
Poria, Soujanya
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3590 - 3598
[8] Text2Performer: Text-Driven Human Video Generation
Jiang, Yuming
Yang, Shuai
Koh, Tong Liang
Wu, Wayne
Loy, Chen Change
Liu, Ziwei
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22690 - 22700
[9] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Fu, Tsu-Jui
Yu, Licheng
Zhang, Ning
Fu, Cheng-Yang
Su, Jong-Chyi
Wang, William Yang
Bell, Sean
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10681 - 10692
[10] ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
Liu, Jiawei
Wang, Weining
Liu, Wei
He, Qian
Liu, Jing
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,

← 1 2 3 →