Diff-BGM: A Diffusion Model for Video Background Music Generation

被引:1
|
作者
Li, Sizhe [1 ]
Qin, Yiming [1 ]
Zheng, Minghang [1 ]
Jin, Xin [2 ,3 ]
Liu, Yang [1 ]
机构
[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China
[2] Beijing Elect Sci & Technol Inst, Beijing, Peoples R China
[3] Beijing Inst Gen Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.02582
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
When editing a video, a piece of attractive background music is indispensable. However, video background music generation tasks face several challenges, for example, the lack of suitable training datasets, and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work, we first propose a high-quality music-video dataset BGM909 with detailed annotation and shot detection to provide multi-modal information about the video and music. We then present evaluation metrics to assess music quality, including music diversity and alignment between music and video with retrieval precision metrics. Finally, we propose the Diff-BGM framework to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process, i.e., uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by introducing a segment-aware cross-attention layer. Experiments verify the effectiveness of our proposed method. The code and models are available at https://github.com/sizhelee/Diff-BGM.
引用
收藏
页码:27338 / 27347
页数:10
相关论文
共 50 条
  • [41] A system for automatic generation of music sports-video
    Zhang, WG
    Xing, LY
    Huang, QM
    Gao, W
    2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, 2005, : 1287 - 1290
  • [42] Automated Music Video Generation Using Emotion Synchronization
    Shin, Ki-Ho
    Kim, Hye-Rin
    Lee, In-Kwon
    2016 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2016, : 2594 - 2597
  • [43] DSR-Diff: Depth map super-resolution with diffusion model
    Shi, Yuan
    Cao, Huiyun
    Xia, Bin
    Zhu, Rui
    Liao, Qingmin
    Yang, Wenming
    PATTERN RECOGNITION LETTERS, 2024, 184 : 225 - 231
  • [44] ASD-Diff: Unsupervised Anomalous Sound Detection with Masked Diffusion Model
    Fan, Xin
    Fang, Wenjie
    Hu, Ying
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2024, 2025, 2312 : 55 - 65
  • [45] Automatic Recommendation Algorithm for Video Background Music Based on Deep Learning
    Kai, Hong
    COMPLEXITY, 2021, 2021
  • [46] Automatic synthesis of background music track data by analysis of video contents
    Modegi, T
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2004, PT 3, PROCEEDINGS, 2004, 3333 : 591 - 598
  • [47] BACKGROUND MUSIC RECOMMENDATION FOR VIDEO BASED ON MULTIMODAL LATENT SEMANTIC ANALYSIS
    Kuo, Fang-Fei
    Shan, Man-Kwan
    Lee, Suh-Yin
    2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2013), 2013,
  • [48] UP-Diff: Latent Diffusion Model for Remote Sensing Urban Prediction
    Wang, Zeyu
    Hao, Zecheng
    Zhang, Yuhan
    Feng, Yuchao
    Guo, Yufei
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2025, 22
  • [49] Automatic synthesis of background music track data by analysis of video contents
    Modegi, T
    IEEE INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES 2004 (ISCIT 2004), PROCEEDINGS, VOLS 1 AND 2: SMART INFO-MEDIA SYSTEMS, 2004, : 431 - 436
  • [50] Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing
    Lin, Jen-Chun
    Wei, Wen-Li
    Yang, James
    Wang, Hsin-Min
    Liao, Hong-Yuan Mark
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 519 - 527