TA2V: Text-Audio Guided Video Generation

被引:0
|
作者
Zhao, Minglu [1 ]
Wang, Wenmin [1 ]
Chen, Tongbao [1 ]
Zhang, Rui [2 ]
Li, Ruochen [1 ]
机构
[1] Macau Univ Sci & Technol, Sch Comp Sci & Engn, Macau 999078, Peoples R China
[2] Beijing Inst Technol, Sch Mech Engn, Beijing 100811, Peoples R China
关键词
Multimodal video generation; text-audio to video; VQ-GAN; diffusion; deep learning; NETWORK;
D O I
10.1109/TMM.2024.3362149
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent conditional and unconditional video generation tasks have been accomplished mainly based on generative adversarial network (GAN), diffusion, and autoregressive models. However, in some circumstances, using only one modality cannot provide enough semantic information. Therefore, in this paper, we propose text-audio to video (TA2V) generation, a new task for generating realistic videos from two different guided modalities, text and audio, which has not been explored much thus far. Compared to image generation, video generation is a harder task because of the complexity of processing higher-dimensional data and scarcer suitable datasets, especially for multimodal video generation. To overcome these limitations, (i) we propose the Text&Audio-guided-Video-Maker (TAgVM) model, which consists of two modules: a text-guided video generator and a text&audio-guided video modifier. (ii) This model uses a 3D VQ-GAN to compress high-dimension video data to a low-dimension discrete sequence, followed by an autoregressive model to guide text-conditional generation in the latent space. Then, we apply a text&audio-guided diffusion model to the generated video scenes, providing additional semantic details corresponding to the audio and text. (iii) We introduce a newly produced music performance video dataset, the University of Rochester Multimodal Music Performance with Video-Audio-Text (URMP-VAT), and a landscape dataset, Landscape with Video-Audio-Text (Landscape-VAT), both of which include three modalities (text, audio, and video) that are aligned with each other. The results demonstrate that our model can create videos with satisfactory quality and semantic information.
引用
收藏
页码:7250 / 7264
页数:15
相关论文
共 21 条
  • [1] MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
    Hayes, Thomas
    Zhang, Songyang
    Yin, Xi
    Pang, Guan
    Sheng, Sasha
    Yang, Harry
    Ge, Songwei
    Hu, Qiyuan
    Parikh, Devi
    COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 431 - 449
  • [2] Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
    Yariv, Guy
    Gat, Itai
    Benaim, Sagie
    Wolf, Lior
    Schwartz, Idan
    Adi, Yossi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6639 - 6647
  • [3] Text2Video: Automatic Video Generation Based on Text Scripts
    Yu, Yipeng
    Tu, Zirui
    Lu, Longyu
    Chen, Xiao
    Zhan, Hui
    Sun, Zixun
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2753 - 2755
  • [4] Automated generation of news content hierarchy by integrating audio, video, and text information
    Huang, Q
    Liu, Z
    Rosenberg, A
    Gibbon, D
    Shahraray, B
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 3025 - 3028
  • [5] Sounding Video Generator: A Unified Framework for Text-Guided Sounding Video Generation
    Liu, Jiawei
    Wang, Weining
    Chen, Sihan
    Zhu, Xinxin
    Liu, Jing
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 141 - 153
  • [6] Learning Universal Policies via Text-Guided Video Generation
    Du, Yilun
    Yang, Mengjiao
    Dai, Bo
    Dai, Hanjun
    Nachum, Ofir
    Tenenbaum, Joshua B.
    Schuurmans, Dale
    Abbeel, Pieter
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Text-to-Audio Generation using Instruction-Guided Latent Diffusion Model
    Ghosal, Deepanway
    Majumder, Navonil
    Mehrish, Ambuj
    Poria, Soujanya
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3590 - 3598
  • [8] Text2Performer: Text-Driven Human Video Generation
    Jiang, Yuming
    Yang, Shuai
    Koh, Tong Liang
    Wu, Wayne
    Loy, Chen Change
    Liu, Ziwei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22690 - 22700
  • [9] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
    Fu, Tsu-Jui
    Yu, Licheng
    Zhang, Ning
    Fu, Cheng-Yang
    Su, Jong-Chyi
    Wang, William Yang
    Bell, Sean
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10681 - 10692
  • [10] ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
    Liu, Jiawei
    Wang, Weining
    Liu, Wei
    He, Qian
    Liu, Jing
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,