Fine-grained Audible Video Description

被引:6
|
作者
Shen, Xuyang [2 ]
Li, Dong [1 ]
Zhou, Jinxing
Qin, Zhen [2 ]
He, Bowen [2 ]
Han, Xiaodong [2 ]
Li, Aixuan [4 ]
Dai, Yuchao [4 ]
Kong, Lingpeng [5 ]
Wang, Meng [3 ]
Qiao, Yu [1 ]
Zhong, Yiran [1 ]
机构
[1] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[2] OpenNLPLab, Beijing, Peoples R China
[3] Hefei Univ Technol, Hefei, Peoples R China
[4] Northwestern Polytech Univ, Xian, Peoples R China
[5] Univ Hong Kong, Hong Kong, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.01020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, i.e., the caption, followed by 4-6 sentences describing the visual details and 12 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.
引用
收藏
页码:10585 / 10596
页数:12
相关论文
共 50 条
  • [1] Fine-Grained Scalable Video Caching
    Gong, Qiushi
    Woods, John W.
    Kar, Koushik
    Chakareski, Jacob
    2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 101 - 106
  • [2] Fine-Grained Video Retrieval With Scene Sketches
    Zuo, Ran
    Deng, Xiaoming
    Chen, Keqi
    Zhang, Zhengming
    Lai, Yu-Kun
    Liu, Fang
    Ma, Cuixia
    Wang, Hao
    Liu, Yong-Jin
    Wang, Hongan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3136 - 3149
  • [3] Favor: Fine-Grained Video Rate Adaptation
    He, Jian
    Qureshi, Mubashir Adnan
    Qiu, Lili
    Li, Jin
    Li, Feng
    Han, Lei
    PROCEEDINGS OF THE 9TH ACM MULTIMEDIA SYSTEMS CONFERENCE (MMSYS'18), 2018, : 64 - 75
  • [4] Fine-grained Video Captioning for Sports Narrative
    Yu, Huanyu
    Cheng, Shuo
    Ni, Bingbing
    Wang, Minsi
    Zhang, Jian
    Yang, Xiaokang
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6006 - 6015
  • [5] FIVR: Fine-Grained Incident Video Retrieval
    Kordopatis-Zilos, Giorgos
    Papadopoulos, Symeon
    Patras, Ioannis
    Kompatsiaris, Ioannis
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (10) : 2638 - 2652
  • [6] Coarse-to-Fine Description for Fine-Grained Visual Categorization
    Yao, Hantao
    Zhang, Shiliang
    Zhang, Yongdong
    Li, Jintao
    Tian, Qi
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (10) : 4858 - 4872
  • [7] Fine-Grained Crowdsourcing for Fine-Grained Recognition
    Jia Deng
    Krause, Jonathan
    Li Fei-Fei
    2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 580 - 587
  • [8] Online video advertising based on fine-grained video tags
    Lu, Feng
    Wang, Zirui
    Liao, Xiaofei
    Jin, Hai
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2014, 51 (12): : 2733 - 2745
  • [9] Fine-grained Image Classification Combined with Label Description
    Shi, Xiruo
    Xu, Liutong
    Wang, Pengfei
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1057 - 1064
  • [10] Lane Attribute Classification Based on Fine-Grained Description
    He, Zhonghe
    Gong, Pengfei
    Ye, Hongcheng
    Gan, Zizheng
    SENSORS, 2024, 24 (15)