Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach

被引：3

作者：

Pini, Stefano ^{[1
]}

Cornia, Marcella ^{[1
]}

Baraldi, Lorenzo ^{[1
]}

Cucchiara, Rita ^{[1
]}

机构：

[1] Univ Modena & Reggio Emilia, Dipartimento Ingn Enzo Ferrari, Modena, Italy

来源：

IMAGE ANALYSIS AND PROCESSING (ICIAP 2017), PT II | 2017年 / 10485卷

关键词：

Video captioning; Naming; Datasets; Deep learning;

D O I：

10.1007/978-3-319-68548-9_36

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.

引用

页码：384 / 395

页数：12

共 50 条

[31] Loosely-coupled approach towards multi-modal browsing
Jan Kleindienst
Ladislav Seredi
Pekka Kapanen
Janne Bergman
Universal Access in the Information Society, 2003, 2 (2) : 173 - 188
[32] RETRACTION: An Efficient Deep Learning-based Video Captioning Framework Using Multi-modal Features
Varma, S.
James, D. P.
EXPERT SYSTEMS, 2025, 42 (02)
[33] Multi-modal fusion for video understanding
Hoogs, A
Mundy, J
Cross, G
30TH APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, PROCEEDINGS: ANALYSIS AND UNDERSTANDING OF TIME VARYING IMAGERY, 2001, : 103 - 108
[34] Contextualized Keyword Representations for Multi-modal Retinal Image Captioning
Huang, Jia-Hong
Wu, Ting-Wei
Worring, Marcel
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 645 - 652
[35] The CropAndWeed Dataset: a Multi-Modal Learning Approach for Efficient Crop and Weed Manipulation
Steininger, Daniel
Trondl, Andreas
Croonen, Gerardus
Simon, Julia
Widhalm, Verena
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3718 - 3727
[36] Towards Unified Multi-modal Dataset Creation for Deep Learning Utilizing Structured Reports
Toelle, Malte
Burger, Lukas
Kelm, Halvar
Engelhardt, Sandy
BILDVERARBEITUNG FUR DIE MEDIZIN 2024, 2024, : 130 - 135
[37] A multi-modal video analysis approach for car park fire detection
Verstockt, Steven
Van Hoecke, Sofie
Beji, Tarek
Merci, Bart
Gouverneur, Benedict
Cetin, A. Enis
De Potter, Pieterjan
Van de Walle, Rik
FIRE SAFETY JOURNAL, 2013, 57 : 44 - 57
[38] Automated Multi-Modal Video Editing for Ads Video
Lin, Qin
Pang, Nuo
Hong, Zhiying
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4823 - 4827
[39] A multi-subject, multi-modal human neuroimaging dataset
Wakeman, Daniel G.
Henson, Richard N.
SCIENTIFIC DATA, 2015, 2
[40] A multi-subject, multi-modal human neuroimaging dataset
Daniel G Wakeman
Richard N Henson
Scientific Data, 2

← 1 2 3 4 5 →