Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach

被引:3
|
作者
Pini, Stefano [1 ]
Cornia, Marcella [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ]
机构
[1] Univ Modena & Reggio Emilia, Dipartimento Ingn Enzo Ferrari, Modena, Italy
关键词
Video captioning; Naming; Datasets; Deep learning;
D O I
10.1007/978-3-319-68548-9_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.
引用
收藏
页码:384 / 395
页数:12
相关论文
共 50 条
  • [41] OutFin, a multi-device and multi-modal dataset for outdoor localization based on the fingerprinting approach
    Alhomayani, Fahad
    Mahoor, Mohammad H.
    SCIENTIFIC DATA, 2021, 8 (01)
  • [42] OutFin, a multi-device and multi-modal dataset for outdoor localization based on the fingerprinting approach
    Fahad Alhomayani
    Mohammad H. Mahoor
    Scientific Data, 8
  • [43] Towards a multi-modal perceptual model
    Hollier, MP
    Voelcker, R
    BT TECHNOLOGY JOURNAL, 1997, 15 (04): : 162 - 171
  • [44] Multi-Modal Multi-Action Video Recognition
    Shi, Zhensheng
    Liang, Ju
    Li, Qianqian
    Zheng, Haiyong
    Gu, Zhaorui
    Dong, Junyu
    Zheng, Bing
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13658 - 13667
  • [45] A multi-modal dataset for gait recognition under occlusion
    Li, Na
    Zhao, Xinbo
    APPLIED INTELLIGENCE, 2023, 53 (02) : 1517 - 1534
  • [46] MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD
    Liu, Tao
    Fang, Shuai
    Xiang, Xu
    Song, Hongbo
    Lin, Shaoxiong
    Sun, Jiaqi
    Han, Tianyuan
    Chen, Siyuan
    Yao, Binwei
    Liu, Sen
    Wu, Yifei
    Qian, Yanmin
    Yu, Kai
    INTERSPEECH 2022, 2022, : 1476 - 1480
  • [47] SynDrone - Multi-modal UAV Dataset for Urban Scenarios
    Rizzoli, Giulia
    Barbato, Francesco
    Caligiuri, Matteo
    Zanuttigh, Pietro
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2202 - 2212
  • [48] MMChat: Multi-Modal Chat Dataset on Social Media
    Zheng, Yinhe
    Chen, Guanyi
    Liu, Xin
    Sun, Jian
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5778 - 5786
  • [49] A multi-modal dataset for gait recognition under occlusion
    Na Li
    Xinbo Zhao
    Applied Intelligence, 2023, 53 : 1517 - 1534
  • [50] A multi-modal machine learning approach towards predicting patient readmission
    Mohanty, Somya D.
    Lekan, Deborah
    McCoy, Thomas P.
    Jenkins, Marjorie
    Manda, Prashanti
    2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 2027 - 2035