Fine-grained Audible Video Description

被引:6
|
作者
Shen, Xuyang [2 ]
Li, Dong [1 ]
Zhou, Jinxing
Qin, Zhen [2 ]
He, Bowen [2 ]
Han, Xiaodong [2 ]
Li, Aixuan [4 ]
Dai, Yuchao [4 ]
Kong, Lingpeng [5 ]
Wang, Meng [3 ]
Qiao, Yu [1 ]
Zhong, Yiran [1 ]
机构
[1] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[2] OpenNLPLab, Beijing, Peoples R China
[3] Hefei Univ Technol, Hefei, Peoples R China
[4] Northwestern Polytech Univ, Xian, Peoples R China
[5] Univ Hong Kong, Hong Kong, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.01020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, i.e., the caption, followed by 4-6 sentences describing the visual details and 12 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.
引用
收藏
页码:10585 / 10596
页数:12
相关论文
共 50 条
  • [21] Fine-Grained Clothing Image Classification by Style Feature Description
    Wu M.
    Liu L.
    Fu X.
    Liu L.
    Huang Q.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (05): : 780 - 791
  • [22] RADIAL LOSS FOR LEARNING FINE-GRAINED VIDEO SIMILARITY METRIC
    Jain, Abhinav
    Agarwal, Prerna
    Mujumdar, Shashank
    Gupta, Nitin
    Mehta, Sameep
    Chattopadhyay, Chiranjoy
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1652 - 1656
  • [23] Proxy cache management for fine-grained scalable video streaming
    Liu, JC
    Chu, XW
    Xu, JL
    IEEE INFOCOM 2004: THE CONFERENCE ON COMPUTER COMMUNICATIONS, VOLS 1-4, PROCEEDINGS, 2004, : 1490 - 1500
  • [24] Fine-Grained Object Detection of Satellite Video in the Frequency Domain
    Sun, Yuhan
    Li, Shengyang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2025, 22
  • [25] Fine-Grained Classification of Pedestrians in Video: Benchmark and State of the Art
    Hall, David
    Perona, Pietro
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 5482 - 5491
  • [26] The fine-grained scalable video coding based on matching pursuits
    Lin, JL
    Hwang, WL
    Pei, SC
    2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL II, PROCEEDINGS, 2002, : 53 - 56
  • [27] Rate control for fully fine-grained scalable video coders
    Prades-Nebot, J
    Cook, GW
    Delp, EJ
    VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2002, PTS 1 AND 2, 2002, 4671 : 828 - 839
  • [28] FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting
    Liu, Rui
    Deng, Hanming
    Huang, Yangyi
    Shi, Xiaoyu
    Lu, Lewei
    Sun, Wenxiu
    Wang, Xiaogang
    Dai, Jifeng
    Li, Hongsheng
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14020 - 14029
  • [29] FineAction: A Fine-Grained Video Dataset for Temporal Action Localization
    Liu, Yi
    Wang, Limin
    Wang, Yali
    Ma, Xiao
    Qiao, Yu
    IEEE Transactions on Image Processing, 2022, 31 : 6937 - 6950
  • [30] iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
    Lin X.
    Jin Q.
    Chen S.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (08): : 1350 - 1357