Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach

被引:3
|
作者
Pini, Stefano [1 ]
Cornia, Marcella [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ]
机构
[1] Univ Modena & Reggio Emilia, Dipartimento Ingn Enzo Ferrari, Modena, Italy
关键词
Video captioning; Naming; Datasets; Deep learning;
D O I
10.1007/978-3-319-68548-9_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.
引用
收藏
页码:384 / 395
页数:12
相关论文
共 50 条
  • [21] Multi-modal Video Summarization
    Huang, Jia-Hong
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1214 - 1218
  • [22] Tencent AVS: A Holistic Ads Video Dataset for Multi-Modal Scene Segmentation
    Jiang, Jie
    Li, Zhimin
    Xiong, Jiangfeng
    Quan, Rongwei
    Lu, Qinglin
    Liu, Wei
    IEEE ACCESS, 2022, 10 : 128959 - 128969
  • [23] A novel multi-modal neural network approach for dynamic and generic sports video summarization
    Narwal, Pulkit
    Duhan, Neelam
    Bhatia, Komal Kumar
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 126
  • [24] Multi-Modal Graph Aggregation Transformer for image captioning
    Chen, Lizhi
    Li, Kesen
    NEURAL NETWORKS, 2025, 181
  • [25] An efficient deep learning-based video captioning framework using multi-modal features
    Varma, Soumya
    James, Dinesh Peter
    EXPERT SYSTEMS, 2021,
  • [26] New Approach to Multi-Modal Multi-View Video Coding
    Zhang Yun
    Yu Mei
    Jiang Gangyi
    CHINESE JOURNAL OF ELECTRONICS, 2009, 18 (02): : 338 - 342
  • [27] GLOVE-ING ATTENTION: A MULTI-MODAL NEURAL LEARNING APPROACH TO IMAGE CAPTIONING
    Anundskas, Lars Halvor
    Afridi, Hina
    Tarekegn, Adane Nega
    Yamin, Muhammad Mudassar
    Ullah, Mohib
    Yamin, Saira
    Cheikh, Faouzi Alaya
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [28] A hybrid approach to news video classification with multi-modal features
    Wang, P
    Cai, R
    Yang, SQ
    ICICS-PCM 2003, VOLS 1-3, PROCEEDINGS, 2003, : 787 - 791
  • [29] Multi-modal Sarcasm Generation: Dataset and Solution
    Zhao, Wenye
    Huang, Qingbao
    Xu, Dongsheng
    Zhao, Peizhi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5601 - 5613
  • [30] TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World
    Lin, Hongpeng
    Ruan, Ludan
    Xia, Wenke
    Liu, Peiyu
    Wen, Jingyuan
    Xu, Yixin
    Hu, Di
    Song, Ruihua
    Zhao, Wayne Xin
    Jin, Qin
    Lu, Zhiwu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 1303 - 1313