Early Embedding and Late Reranking for Video Captioning

被引:48
|
作者
Dong, Jianfeng [1 ]
Li, Xirong [2 ]
Lan, Weiyu [2 ]
Huo, Yujia [2 ]
Snoek, Cees G. M. [3 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China
[2] Renmin Univ China, Key Lab Data Engn & Knowledge Engn, Beijing, Peoples R China
[3] Univ Amsterdam, Intelligent Syst Lab Amsterdam, Amsterdam, Netherlands
关键词
Video captioning; MSR; Video to Language Challenge; Tag embedding; Sentence reranking;
D O I
10.1145/2964284.2984064
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes our solution for the MSR ideo to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a. clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.
引用
收藏
页码:1082 / 1086
页数:5
相关论文
共 50 条
  • [31] Bilingual video captioning model for enhanced video retrieval
    Alrebdi, Norah
    Al-Shargabi, Amal A.
    [J]. JOURNAL OF BIG DATA, 2024, 11 (01)
  • [32] From Video to Language: Survey of Video Captioning and Description
    Tang, Peng-Jie
    Wang, Han-Li
    [J]. Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (02): : 375 - 397
  • [33] Incorporating the Graph Representation of Video and Text into Video Captioning
    Lu, Min
    Li, Yuan
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
  • [34] Watch It Twice: Video Captioning with a Refocused Video Encoder
    Shi, Xiangxi
    Cai, Jianfei
    Joty, Shafiq
    Gu, Jiuxiang
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 818 - 826
  • [35] Sparse Transfer Learning for Interactive Video Search Reranking
    Tian, Xinmei
    Tao, Dacheng
    Rui, Yong
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2012, 8 (03) : 1 - 19
  • [36] Video Interactive Captioning with Human Prompts
    Wu, Aming
    Han, Yahong
    Yang, Yi
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 961 - 967
  • [37] Accurate and Fast Compressed Video Captioning
    Shen, Yaojie
    Gu, Xin
    Xu, Kai
    Fan, Heng
    Wen, Longyin
    Zhang, Libo
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15512 - 15521
  • [38] A Deep Structured Model for Video Captioning
    Vinodhini, V.
    Sathiyabhama, B.
    Sankar, S.
    Somula, Ramasubbareddy
    [J]. INTERNATIONAL JOURNAL OF GAMING AND COMPUTER-MEDIATED SIMULATIONS, 2020, 12 (02) : 44 - 56
  • [39] Semantic Grouping Network for Video Captioning
    Ryu, Hobin
    Kang, Sunghun
    Kang, Haeyong
    Yoo, Chang D.
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
  • [40] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927