Fine-grained Audible Video Description

被引：6

作者：

Shen, Xuyang ^{[2
]}

Li, Dong ^{[1
]}

Zhou, Jinxing

Qin, Zhen ^{[2
]}

He, Bowen ^{[2
]}

Han, Xiaodong ^{[2
]}

Li, Aixuan ^{[4
]}

Dai, Yuchao ^{[4
]}

Kong, Lingpeng ^{[5
]}

Wang, Meng ^{[3
]}

Qiao, Yu ^{[1
]}

Zhong, Yiran ^{[1
]}

机构：

[1] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

[2] OpenNLPLab, Beijing, Peoples R China

[3] Hefei Univ Technol, Hefei, Peoples R China

[4] Northwestern Polytech Univ, Xian, Peoples R China

[5] Univ Hong Kong, Hong Kong, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.01020

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, i.e., the caption, followed by 4-6 sentences describing the visual details and 12 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.

引用

页码：10585 / 10596

页数：12

共 50 条

[31] iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
Lin, Xiaozhu
Jin, Qin
Chen, Shizhe
Song, Yuqing
Zhao, Yida
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 78 - 88
[32] FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
Shao, Dian
Zhao, Yue
Dai, Bo
Lin, Dahua
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2613 - 2622
[33] Fine-grained scalable video broadcasting over cellular networks
Liu, JC
Li, B
Li, B
Cao, XR
IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : 417 - 420
[34] Fine-Grained Video Traffic Classification Based on QoE Values
Lingyun Yang
Yuning Dong
Md. Sohel Rana
Zaijian Wang
Wireless Personal Communications, 2018, 103 : 1481 - 1498
[35] Fine-Grained Video Traffic Classification Based on QoE Values
Yang, Lingyun
Dong, Yuning
Rana, Md. Sohel
Wang, Zaijian
WIRELESS PERSONAL COMMUNICATIONS, 2018, 103 (02) : 1481 - 1498
[36] Fine-Grained Length Controllable Video Captioning With Ordinal Embeddings
Nitta, Tomoya
Fukuzawa, Takumi
Tamaki, Toru
IEEE ACCESS, 2024, 12 : 189667 - 189688
[37] FineAction: A Fine-Grained Video Dataset for Temporal Action Localization
Liu, Yi
Wang, Limin
Wang, Yali
Ma, Xiao
Qiao, Yu
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6937 - 6950
[38] Improve Fine-Grained Feature Learning in Fine-Grained DataSet GAI
Wang, Hai Peng
Geng, Zhi Qing
IEEE ACCESS, 2025, 13 : 12777 - 12788
[39] Leveraging Fine-Grained Labels to Regularize Fine-Grained Visual Classification
Wu, Junfeng
Yao, Li
Liu, Bin
Ding, Zheyuan
PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON COMPUTER MODELING AND SIMULATION (ICCMS 2019) AND 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND APPLICATIONS (ICICA 2019), 2019, : 133 - 136
[40] FINE-GRAINED MONOLITH
Louw, Michael
ARCHITECTURE SOUTH AFRICA, 2019, (96): : 48 - 49

← 1 2 3 4 5 →