Diff-BGM: A Diffusion Model for Video Background Music Generation

被引:1
|
作者
Li, Sizhe [1 ]
Qin, Yiming [1 ]
Zheng, Minghang [1 ]
Jin, Xin [2 ,3 ]
Liu, Yang [1 ]
机构
[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China
[2] Beijing Elect Sci & Technol Inst, Beijing, Peoples R China
[3] Beijing Inst Gen Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.02582
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
When editing a video, a piece of attractive background music is indispensable. However, video background music generation tasks face several challenges, for example, the lack of suitable training datasets, and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work, we first propose a high-quality music-video dataset BGM909 with detailed annotation and shot detection to provide multi-modal information about the video and music. We then present evaluation metrics to assess music quality, including music diversity and alignment between music and video with retrieval precision metrics. Finally, we propose the Diff-BGM framework to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process, i.e., uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by introducing a segment-aware cross-attention layer. Experiments verify the effectiveness of our proposed method. The code and models are available at https://github.com/sizhelee/Diff-BGM.
引用
收藏
页码:27338 / 27347
页数:10
相关论文
共 50 条
  • [21] DIFF-FOLEY: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
    Luo, Simian
    Yan, Chuanhao
    Hu, Chenxu
    Zhao, Hang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [22] Adaptive background generation for video object segmentation
    Kim, Taekyung
    Paik, Joonki
    ADVANCES IN VISUAL COMPUTING, PT 1, 2006, 4291 : 871 - +
  • [23] Incorporating Background Knowledge into Video Description Generation
    Whitehead, Spencer
    Ji, Heng
    Bansal, Mohit
    Chang, Shih-Fu
    Voss, Clare
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3992 - 4001
  • [24] Photorealistic Video Generation with Diffusion Models
    Gupta, Agrim
    Hahn, Meera
    Yu, Lijun
    Sohn, Kihyuk
    Xiuye
    Li, Fei-Fei
    Essa, Irfan
    Jiang, E. Lu
    Lezama, Jose
    COMPUTER VISION - ECCV 2024, PT LXXIX, 2025, 15137 : 393 - 411
  • [25] Diffusion Probabilistic Modeling for Video Generation
    Yang, Ruihan
    Srivastava, Prakhar
    Mandt, Stephan
    ENTROPY, 2023, 25 (10)
  • [26] Discrete diffusion model with contrastive learning for music to natural and long dance generation
    Huaxin Wang
    Yujian Jiang
    Xiangzhong Zhou
    Wei Jiang
    npj Heritage Science, 13 (1):
  • [27] Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model
    Kang, Jaeyong
    Poria, Soujanya
    Herremans, Dorien
    Expert Systems with Applications, 2024, 249
  • [28] Diff-ReColor: Rethinking image colorization with a generative diffusion model
    Li, Gehui
    Zhao, Shanshan
    Zhao, Tongtong
    KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [29] Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model
    Kang, Jaeyong
    Poria, Soujanya
    Herremans, Dorien
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
  • [30] Phy-Diff: Physics-Guided Hourglass Diffusion Model for Diffusion MRI Synthesis
    Zhang, Juanhua
    Yan, Ruodan
    Perelli, Alessandro
    Chen, Xi
    Li, Chao
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT II, 2024, 15002 : 345 - 355