Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach

被引:3
|
作者
Pini, Stefano [1 ]
Cornia, Marcella [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ]
机构
[1] Univ Modena & Reggio Emilia, Dipartimento Ingn Enzo Ferrari, Modena, Italy
关键词
Video captioning; Naming; Datasets; Deep learning;
D O I
10.1007/978-3-319-68548-9_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.
引用
收藏
页码:384 / 395
页数:12
相关论文
共 50 条
  • [1] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [2] Multi-modal Dependency Tree for Video Captioning
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [3] CMGNet: Collaborative multi-modal graph network for video captioning
    Rao, Qi
    Yu, Xin
    Li, Guang
    Zhu, Linchao
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
  • [4] A comprehensive video dataset for multi-modal recognition systems
    Handa A.
    Agarwal R.
    Kohli N.
    Data Science Journal, 2019, 18 (01):
  • [5] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
    Munusamy, Hemalatha
    Sekhar, Chandra C.
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479
  • [6] M-VAD names: a dataset for video captioning with naming
    Pini, Stefano
    Cornia, Marcella
    Bolelli, Federico
    Baraldi, Lorenzo
    Cucchiara, Rita
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (10) : 14007 - 14027
  • [7] A Multi-Modal Egocentric Activity Recognition Approach towards Video Domain Generalization
    Papadakis, Antonios
    Spyrou, Evaggelos
    SENSORS, 2024, 24 (08)
  • [8] Event-centric multi-modal fusion method for dense video captioning
    Chang, Zhi
    Zhao, Dexin
    Chen, Huilin
    Li, Jingdan
    Liu, Pengfei
    NEURAL NETWORKS, 2022, 146 : 120 - 129
  • [9] M-VAD names: a dataset for video captioning with naming
    Stefano Pini
    Marcella Cornia
    Federico Bolelli
    Lorenzo Baraldi
    Rita Cucchiara
    Multimedia Tools and Applications, 2019, 78 : 14007 - 14027
  • [10] Towards Developing a Multi-Modal Video Recommendation System
    Pingali, Sriram
    Mondal, Prabir
    Chakder, Daipayan
    Saha, Sriparna
    Ghosh, Angshuman
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,