Fine-grained Audible Video Description

被引:6
|
作者
Shen, Xuyang [2 ]
Li, Dong [1 ]
Zhou, Jinxing
Qin, Zhen [2 ]
He, Bowen [2 ]
Han, Xiaodong [2 ]
Li, Aixuan [4 ]
Dai, Yuchao [4 ]
Kong, Lingpeng [5 ]
Wang, Meng [3 ]
Qiao, Yu [1 ]
Zhong, Yiran [1 ]
机构
[1] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[2] OpenNLPLab, Beijing, Peoples R China
[3] Hefei Univ Technol, Hefei, Peoples R China
[4] Northwestern Polytech Univ, Xian, Peoples R China
[5] Univ Hong Kong, Hong Kong, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.01020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, i.e., the caption, followed by 4-6 sentences describing the visual details and 12 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.
引用
收藏
页码:10585 / 10596
页数:12
相关论文
共 50 条
  • [41] Is fine-grained viable?
    Aaldering, M
    EDN, 1997, 42 (02) : 28 - 28
  • [42] Fine-Grained Cryptography
    Degwekar, Akshay
    Vaikuntanathan, Vinod
    Vasudevan, Prashant Nalini
    ADVANCES IN CRYPTOLOGY (CRYPTO 2016), PT III, 2016, 9816 : 533 - 562
  • [43] TenniSet: A Dataset for Dense Fine-Grained Event Recognition, Localisation and Description
    Faulkner, Hayden
    Dick, Anthony
    2017 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING - TECHNIQUES AND APPLICATIONS (DICTA), 2017, : 634 - 641
  • [44] A comment on a fine-grained description of evaporating black holes with baby universes
    Iizuka, Norihiro
    Miyata, Akihiro
    Ugajin, Tomonori
    JOURNAL OF HIGH ENERGY PHYSICS, 2022, 2022 (09)
  • [45] A comment on a fine-grained description of evaporating black holes with baby universes
    Norihiro Iizuka
    Akihiro Miyata
    Tomonori Ugajin
    Journal of High Energy Physics, 2022
  • [46] Unbiased, Fine-Grained Description of Processes Performance from Event Data
    Denisov, Vadim
    Fahland, Dirk
    van der Aalst, Wil M. P.
    BUSINESS PROCESS MANAGEMENT (BPM 2018), 2018, 11080 : 139 - 157
  • [47] Learning Fine-Grained Features for Pixel-wise Video Correspondences
    Li, Rui
    Zhou, Shenglong
    Liu, Dong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9598 - 9607
  • [48] Video deblocking with fine-grained scalable complexity for embedded mobile computing
    Yu, ZH
    Zhang, J
    2004 7TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS 1-3, 2004, : 1173 - 1178
  • [49] A Fine-Grained Spatial-Temporal Attention Model for Video Captioning
    Liu, An-An
    Qiu, Yurui
    Wong, Yongkang
    Su, Yu-Ting
    Kankanhalli, Mohan
    IEEE ACCESS, 2018, 6 : 68463 - 68471
  • [50] Fine-grained rate shaping for video streaming over wireless networks
    Chen, TPC
    Chen, TH
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 688 - 691