Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach

被引:3
|
作者
Pini, Stefano [1 ]
Cornia, Marcella [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ]
机构
[1] Univ Modena & Reggio Emilia, Dipartimento Ingn Enzo Ferrari, Modena, Italy
关键词
Video captioning; Naming; Datasets; Deep learning;
D O I
10.1007/978-3-319-68548-9_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.
引用
收藏
页码:384 / 395
页数:12
相关论文
共 50 条
  • [31] Loosely-coupled approach towards multi-modal browsing
    Jan Kleindienst
    Ladislav Seredi
    Pekka Kapanen
    Janne Bergman
    Universal Access in the Information Society, 2003, 2 (2) : 173 - 188
  • [32] RETRACTION: An Efficient Deep Learning-based Video Captioning Framework Using Multi-modal Features
    Varma, S.
    James, D. P.
    EXPERT SYSTEMS, 2025, 42 (02)
  • [33] Multi-modal fusion for video understanding
    Hoogs, A
    Mundy, J
    Cross, G
    30TH APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, PROCEEDINGS: ANALYSIS AND UNDERSTANDING OF TIME VARYING IMAGERY, 2001, : 103 - 108
  • [34] Contextualized Keyword Representations for Multi-modal Retinal Image Captioning
    Huang, Jia-Hong
    Wu, Ting-Wei
    Worring, Marcel
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 645 - 652
  • [35] The CropAndWeed Dataset: a Multi-Modal Learning Approach for Efficient Crop and Weed Manipulation
    Steininger, Daniel
    Trondl, Andreas
    Croonen, Gerardus
    Simon, Julia
    Widhalm, Verena
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3718 - 3727
  • [36] Towards Unified Multi-modal Dataset Creation for Deep Learning Utilizing Structured Reports
    Toelle, Malte
    Burger, Lukas
    Kelm, Halvar
    Engelhardt, Sandy
    BILDVERARBEITUNG FUR DIE MEDIZIN 2024, 2024, : 130 - 135
  • [37] A multi-modal video analysis approach for car park fire detection
    Verstockt, Steven
    Van Hoecke, Sofie
    Beji, Tarek
    Merci, Bart
    Gouverneur, Benedict
    Cetin, A. Enis
    De Potter, Pieterjan
    Van de Walle, Rik
    FIRE SAFETY JOURNAL, 2013, 57 : 44 - 57
  • [38] Automated Multi-Modal Video Editing for Ads Video
    Lin, Qin
    Pang, Nuo
    Hong, Zhiying
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4823 - 4827
  • [39] A multi-subject, multi-modal human neuroimaging dataset
    Wakeman, Daniel G.
    Henson, Richard N.
    SCIENTIFIC DATA, 2015, 2
  • [40] A multi-subject, multi-modal human neuroimaging dataset
    Daniel G Wakeman
    Richard N Henson
    Scientific Data, 2